NPS Inventory and Monitoring Division Advanced R Training: February 5 – February 8, 2022

Prep for Training

Installing required software

The only thing you really need to do to prepare for R training is to install the latest version of R and RStudio. We’ll talk about the difference between R and RStudio on the first day, but for now, just make sure they’re installed. Directions for installing R/RStudio are below. If you run into any problems, check the instructions at R for Data Science Section 1.4.1 and 1.4.2.

NOTE: If you don’t have Administrative privileges on your computer, you will have to submit an IT HelpDesk Ticket (need to be on VPN or NPS network to access this link) to install R and RStudio, which could take a while. PLEASE PLAN AHEAD!

Even if you already have R or RStudio installed, please install the latest versions of both programs. R recently went through a major version change from 3.x to 4.x with some potential code-breaking changes. The latest versions are needed to ensure everyone’s code behaves the same way.

Install the latest version of R (4.1.2 as of January 2022) Assuming you’re on a PC, the installer for the latest version of R can be downloaded from The Comprehensive R Archive Network (CRAN). You’ll want to download the R installer by clicking on “Download R 4.0.4 for Windows”, or whatever the latest version number happens to be. After it downloads, open and run the installer with default settings. Note that there’s no need to pin R to your taskbar or add a shortcut to your desktop. For 99% of your needs, you’ll run R within RStudio, and won’t ever open R directly.


Install RStudio The installer for RStudio for PCs can be downloaded here. You’ll need to click on the large blue “DOWNLOAD RSTUDIO FOR WINDOWS” button to download the installer. After it downloads, open and run the installer with default settings. I like having RStudio pinned to my taskbar, so it’s easier to find/open, but it’s up to you and your preferences.


Day 1 Requirements:

The following packages will be used in Day 1’s session on data retrieval. Please run the code chunk below to install all of the required packages that are not currently installed on your machine.

packages <- c("tidyverse", "DBI", "odbc", "readr", "dbplyr", # Reading Databases
              "GADMTools", "sf", "tmap", # GIS in R
              "dataRetrieval", "lubridate", "jsonlite", "httr", # Aquarius
              "rFIA" # for USFS FIA data
              )

install.packages(setdiff(packages, rownames(installed.packages())))  

If folks are having trouble installing tmap due to an issue with one of its dependencies, terra, try running the following code, and then reinstall tmap.

install.packages('terra', repos='https://rspatial.r-universe.dev')
install.packages('tmap')

After you run the code chunk above, please run the following code to ensure everything was successfully installed.

packages <- c("tidyverse", "DBI", "odbc", "readr", "dbplyr", "GADMTools", "sf", "tmap", 
              "dataRetrieval", "lubridate", "jsonlite", "httr", "rFIA")

installed_packages <- packages %in% installed.packages() # check which packages are installed
if (length(packages[!installed_packages]) > 0){
  install.packages(packages[!installed_packages], dep = TRUE)} # if some are missing, install them

lapply(packages, library, character.only = TRUE) 

If you have any issues making this work, contact one of the trainers prior to training, and we’ll help you troubleshoot.


Day 2 Requirements:

The following packages will be used in Day 2’s session on functional programming. Please run the code chunk below to install all of the required packages that are not currently installed on your machine.

packages <- c("tidyverse", "parallel", "microbenchmark")

install.packages(setdiff(packages, rownames(installed.packages())))  

After you run the code chunk above, please run the following code to ensure everything was successfully installed.

packages <- c("tidyverse", "parallel", "microbenchmark")

installed_packages <- packages %in% installed.packages() # check which packages are installed
if (length(packages[!installed_packages]) > 0){
  install.packages(packages[!installed_packages], dep = TRUE)} # if some are missing, install them

lapply(packages, library, character.only = TRUE) 

If you have any issues making this work, contact one of the trainers prior to training, and we’ll help you troubleshoot.


Day 3 Installs: tinytex

If you are attending Day 3: R Markdown, you must have a LaTeX engine installed to output to PDF. The easiest, lightest weight install is tinytex. The tinytex website includes download files and installation instructions.

The following packages will also used in Day 3’s session on R Markdown. Please run the code chunk below to install all of the required packages that are not currently installed on your machine.

packages <- c("rmarkdown", "tidyverse", "knitr", "kableExtra", "DT")

install.packages(setdiff(packages, rownames(installed.packages())))  

After you run the code chunk above, please run the following code to ensure everything was successfully installed.

packages <- c("rmarkdown", "tidyverse", "knitr", "kableExtra", "DT")

installed_packages <- packages %in% installed.packages() # check which packages are installed
if (length(packages[!installed_packages]) > 0){
  install.packages(packages[!installed_packages], dep = TRUE)} # if some are missing, install them

lapply(packages, library, character.only = TRUE) 

If you want to be able to collapse rows in a kable(), you’ll need the development version of kableExtra, which can be installed from GitHub. However, to install from GitHub, you need the devtools package. See Day 4 below for instructions on installing devtools and RTools (required for devtools). After installing devtools and RTools, you can install the development version of kableExtra with the following code:

devtools::install_github("haozhu233/kableExtra")

If you have any issues making this work, contact one of the trainers prior to training, and we’ll help you troubleshoot.


Day 4 Installs: Git for Windows, Rtools, devtools, and roxygen2

If you are attending Day 4: R Packages and Version Control, you need to install Git for Windows, RTools, and a number of packages prior to training. You’ll also need a GitHub account and to connect RStudio and GitHub prior to training for the best experience. More details to prepare for this session are below:

  1. Download the latest 64-bit Git for Windows by clicking on the “Click here to download” link at the top, and installing the file. Once installed, RStudio typically can find it. Note that Git for Windows is on the approved software list. You may or may not require Admin privileges to install. To spare your IT folks unnecessary work, try installing Git for Windows to see if you can install it on your own first.
  2. Download files and instructions for Rtools installation are available on CRAN’s RTools page. It’s a large file and may require admin privileges to install, so be sure to install prior to training. You must also be using R 4.0 or higher for this training, and be sure to download Rtools4.
  3. After you install Rtools, install packages listed in the code chunk below. Note that the devtools package allows you to install packages from GitHub, and will be the easiest way for others to install your packages in their R environment. The devtools package has a lot of dependencies, and you often have to install new packages or update existing packages during the process. If you’re asked to update libraries while trying to install devtools, you should update them. The most common reason the devtools install doesn’t work is because you have an outdated version of one of its dependencies installed.
  4. The following packages will also used in Day 3’s session on R Markdown. Please run the code chunk below to install all of the required packages that are not currently installed on your machine.

    packages <- c("devtools", "usethis", "roxygen2","stringr", "dplyr")
    
    install.packages(setdiff(packages, rownames(installed.packages())))  

    These packages can take a while to download, and sometimes time out before finishing. After you run the code chunk above, please run the following code to ensure everything was successfully installed.

    packages <- c("devtools", "usethis", "roxygen2", "stringr", "dplyr")
    
    installed_packages <- packages %in% installed.packages() # check which packages are installed
    if (length(packages[!installed_packages]) > 0){
      install.packages(packages[!installed_packages], dep = TRUE)} # if some are missing, install them
    
    lapply(packages, library, character.only = TRUE) 
  5. Finally, please complete the steps in the tab Day 4: Version Control > Git and RStudio to create a GitHub account and connect it to RStudio.
  6. If you have any issues making this work, contact Kate Miller or Sarah Wright prior to training, and we’ll help you troubleshoot.


Structure of training

The entire training will take place over 4 half days. Each day will run from 10-2 MST via MS Teams. The hour before training, and the hour after training will also be posted by at least one trainer as office hours, in case there are questions that couldn’t be handled during training.

Each training session has three trainers assigned, two of which will be the main instructors. The third trainer will manage the chat. For most of the training, one of the trainers will share their screen to walk through the website and then demo with live coding. It will help a lot if you have 2 monitors, so you can have the screen being shared on one monitor and your own session of RStudio open on the other.

Half days are barely enough to scratch the surface of the more advanced topics we’re covering in R. Our goals with this training are to help you get beyond the initial learning curve, so you’re armed with the skills you need to continue learning on your own. The trainers put lot of thought and time into designing this training. We did it because we enjoy coding in R and it has greatly increased efficiency and productivity in our jobs. We hope you have a similar experience as you begin applying and learning more about the tools we’re covering this week. As you learn more, please share your skills and code with others to help us build a larger community of R users in IMD who can learn from each other.

Finally, to help us improve this training for future sessions, please leave feedback in the training feedback form.


Day 1: Data Retrieval

Reading Data from Databases

Reading Data from Microsoft Access
#----------Connect to Access------------#

db_path <- "path/to/database.mdb"
conn_string <- paste0("Driver={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=", db_path)

conn <- DBI::dbConnect(odbc::odbc(), .connection_string = conn_string)


#---------------Read the data---------------#

data <- dplyr::tbl(conn, "tbl_MyAccessTable") %>%
  collect()

# Common tidying operations:
data <- mutate_if(data, is.character, trimws) %>%
  mutate_if(is.character, dplyr::na_if, "")

# Close the connection as soon as you no longer need it, otherwise you will get weird errors
dbDisconnect(conn)
Reading Data from a SQL Database
library(DBI)
library(odbc)
library(readr)
library(magrittr)
library(dplyr)
library(dbplyr)

#----------Option 1: Hard code db connection info (not great)------------#

sql_driver <- "SQL Server Native Client 11.0"
my_server <- "SERVER\\NAME"
my_database <- "Database_Name"

conn <- dbConnect(Driver = sql_driver, Server = my_server, Database = my_database, Trusted_Connection = "Yes", drv = odbc::odbc())


#----------Option 2: Read db connection info from shared drive------------#

# This is just a csv with columns for Driver, Server, and Database
params <- readr::read_csv("path/to/csv/on/shared/drive/stlk-database-conn.csv")
params$Trusted_Connection <- "Yes"
params$drv <- odbc::odbc()

conn <- do.call(dbConnect, params)


#----------Option 3: Create a user DSN (easy to use in R, but has to be configured for each user on each computer)------------#

# Create DSN: https://docs.microsoft.com/en-us/sql/relational-databases/native-client-odbc-how-to/configuring-the-sql-server-odbc-driver-add-a-data-source?view=sql-server-ver15
dsn_name <- "myDSN"
conn <- dbConnect(odbc(), dsn_name)


#---------------Read the data---------------#

data <- dplyr::tbl(conn, dbplyr::in_schema("analysis", "Chemistry")) %>%  # The first argument to in_schema is the schema name (SQL default is dbo) and the second is the table name.
  collect()

# Common tidying operations:
  data <- mutate_if(data, is.character, trimws) %>%
    mutate_if(is.character, dplyr::na_if, "")

# Close the connection as soon as you no longer need it, otherwise you will get weird errors
dbDisconnect(conn)

GIS in R

Packages used in this section:

library(GADMTools) # for downloading spatial data
library(sf) # for working with simple features
library(tmap) # for mapping spatial data

If folks are having trouble installing tmap due to an issue with one of its dependencies, terra, try running the following code, and then reinstall tmap.

install.packages('terra', repos='https://rspatial.r-universe.dev')
Background

If you’ve tried to do GIS and make maps in R even a few years ago, you probably encountered the same frustrations I did. There were a ton of packages out there, each with its own file format and coding convention, and each package rarely had all the features I needed. It was not easy to navigate, and I often found myself just exporting my work out of R and doing the rest in ArcGIS… Enter the sf and tmap packages, which are the latest and greatest R packages devoted to GIS and map making! Most of the frustrations I had with earlier packages have been resolved with these two packages.

The sf package is the workhorse for anything you need to do with spatial vector data. File types with sf are called simple features, which follow a set of GIS standards that are becoming the universal data model for vector-based spatial data in R and that most GIS-based packages in R now employ. Simple features are also now the standard for open source GIS in general. That means if you’re trying to troubleshoot something with simple features, you’ll need to specify that it’s for R, rather than PostGIS or some other implementation. The sf package is also superseding the rgdal package, which used to be the more common data model in R and open source GIS. The more I use this package, the more impressed I am with how intuitive it is to use, and how well documented the functions are. For vector data, I have yet to need to perform a task that sf couldn’t do.

The main application of tmap package is making maps, and it was designed using a grammar of graphics philosophy similar to ggplot2. In practice for tmap, it means that maps are built as a collection of many small functions/layers that get added together with pipes (%>%), and order matters. There are also tmap-enabled functions that you can use in ggplot2 plots too, but you can do a lot more in tmap. I also prefer the look of tmap’s built-in compass, legends, etc. over the ones available in ggspatial, which is an add-on package for plotting spatial data in ggplot2.

The raster package is also an excellent package for analyzing/processing raster data. For large jobs, I find the raster package easier to work with than ESRI tools, and it tends to run a lot faster than ESRI built-in tools (I haven’t compared with python).

Finally, the leaflet package in R allows you to create interactive maps as html widgets. These are often included in R shiny apps, but they can also be used in R Markdown with HTML output (for more on that, attend next Wednesday’s session on R Markdown). An internet connection is required for the maps to be interactive. Without an internet connection the map will show the default extent defined by the code.

The leaflet package is relatively easy to use for basic mapping. For more advanced features or to customize it further, you often end up having to code in JavaScript, which is the language leaflet was originally programmed in. There’s also a lot more online help available for the JavaScript version of the leaflet library than the R version. If you’re really stumped about something, you may find your answer with the JavaScript help.


Downloading administrative boundaries

There’s an open source group that hosts a database of geographic boundaries (called GADM), and you can download their data directly in R. Mutliple packages can do this, including GADMTools and geodata. The maintainer of geodata also maintains the raster package, so it’s a solid package. I’m going to show GADMTools, because I found it easier to download and import as simple features, but geodata is perfectly acceptable too.

The code below performs the following steps for each level of GADM available:
  1. Download the countries level of data specified from the GADM server as .Rds file
  2. Reads the downloaded .Rds file into R session
  3. Converts the file in the .Rds to a simple feature and reprojects from WGS84 (EPSG:4326) to North American Continuous Albers Equal Area (EPSG:5070), which makes it UTM. Obviously you’d want to use whatever projection your other data are in.
library(GADMTools)

# Level 0: National
gadm_sf_loadCountries("USA", level = 0, basefile = "./data/")
us_bound <- readRDS("./data/USA_adm0.sf.rds")
us_bound_utm <- st_transform(st_as_sf(us_bound, crs = 4326), 5070)
st_write(us_bound_utm, "./data/US_Boundary.shp", append = F)

# Level 1: State/Province
gadm_sf_loadCountries("USA", level = 1, basefile = "./data/")
us_states <- readRDS("./data/USA_adm1.sf.rds")
us_state_utm <- st_transform(st_as_sf(us_states, crs = 4326), 5070)
st_write(us_state_utm, "./data/US_States.shp", append = F)

# Level 2: County/district
gadm_sf_loadCountries("USA", level = 2, basefile = "./data/")
us_cnty <- readRDS("./data/USA_adm2.sf.rds")
us_cnty_utm <- st_transform(st_as_sf(us_cnty, crs = 4326), 5070)
st_write(us_cnty_utm, "./data/US_Counties.shp", append = F)
plot(us_state_utm[1])


NPS Park tiles

The urls below are the different park tiles available, which can be plotted in the background of leaflet or tmap maps. I’ll show how that works with an example using tmap.

# Load park tiles
NPSbasic = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck58pyquo009v01p99xebegr9/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
  
NPSimagery = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck72fwp2642dv07o7tbqinvz4/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
  
NPSslate = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck5cpvc2e0avf01p9zaw4co8o/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
  
NPSlight = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck5cpia2u0auf01p9vbugvcpv/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'

Using the us_cnty_utm layer, which we downloaded from GADM and then reprojected to EPSG 5070 (UTM Albers), we’ll make a quick map to show how tmap works. First I’ll filter the full county layer to just include the state of Minnesota. Then I’ll make a map with the MN layer called mn_map, which first adds the mn_cnty layer via tm_shape(), then we specifies how the layer should look using tm_borders(). Finally we use tmap_options() to set the baselayers of the map, which are the part tiles that were defined above.

# Map 
mn_cnty <- st_as_sf(us_cnty_utm[us_cnty_utm$NAME_1 == "Minnesota",])

mn_map <- tm_shape(mn_cnty, projection = 5070) +
          tm_borders(col = "black")

tmap_options(basemaps = c(Map = NPSbasic,
                          Imagery = NPSimagery,
                          Light = NPSlight,
                          Slate = NPSslate))

Now we’ll change the mode to make the plot interactive.

tmap_mode('view') # set to interactive mode
mn_map
tmap_mode('plot') # return to static mode

Web Services

Coming soon…

Downloading FIA data

Download FIA Manually

You can download USFS FIA data in R using their DataStore url. The code below downloads multiple states’ PLOT table, prints the length of the table (which you’ll want to check against the DataStore recrods), and saves them to a data folder on my hard drive.

library(purrr)
# Create function to iterate download
downFIA <- function(state, table){
  csv_name <- paste(state, table, sep = "_")
  csv_link <- paste0('http://apps.fs.usda.gov/fia/datamart/CSV/', csv_name, '.csv')
  assign(csv_name, read.csv(csv_link))
  write.csv(csv_name, paste0("./data/", csv_name, '.csv'))
}

# Create list of states of interest
states = c("CT", "RI", "DE")

# Use purrr::map to download each state's PLOT table.
map(states, ~downFIA(., "PLOT"))
Note that you can take this another step further with purrr::map2() to download from a list of states and a list of tables. But note that these are big files and the process may crash if trying to do too much in one call.

Download with rFIA

The rFIA package was recently developed through a cooperative agreement between NETN and Hunter Stanke, a grad student at Michigan State University. The package was designed to make it easier to access and import FIA data into R and to perform common queries and analyses for specified areas and time series with the FIA data.

Load library and download data

FIA dataset are huge and can take awhile to download. For the purposes here, we’re going to download DE and RI’s data, since they’re the smallest.

library(rFIA)

# Download CT and RI state data tables to data folder
getFIA(states = c('DE', 'RI'), dir = "./data")

# Download reference tables (ie lookup tables)
getFIA(states = 'ref', tables = c('SPECIES', 'PLANT_DICTIONARY'))

Filtering the data

The code below will import database objects into your global environment. To see the names of the tables within the database object, you can just write names(fia_all). To view a specific table, you can write fia_all$PLOT.

# Import all states in R that's in the data folder
fia_all <- readFIA("./data")
names(fia_all) # Print table names
table(fia_all$PLOT$STATECD) # View number of plots by state code. Should be 2.
table(fia_all$PLOT$INVYR) # Range of inventory years in the data

# Import RI data only
fia_ri <- readFIA("./data", states = 'RI')
names(fia_ri)
head(fia_ri$PLOT)
table(fia_ri$PLOT$STATECD) # View number of plots by state code. Now only 1.
table(fia_ri$PLOT$INVYR)

Other really useful features of the rFIA package are that you can clip the data to a time period or county. The code below clips the FIA data for RI to the most recent survey of each plot in their sample design.

# Clip all data to most recent sample
all_recent <- clipFIA(fia_all, mostRecent = TRUE)

# Print list of counties in RI
countiesRI 

# Clip RI to Washington County
ri_Wash <- clipFIA(fia_ri, mask = countiesRI[5,], mostRecent = F)

You can also calculate common summary statistics and sampling errors using other functions within rFIA, including carbon, biomass, diversity, down woody material, seedling density, tree population estimates, tree growth and mortality, etc. I’ll show this for the tree population estimates using the tpa() function. These functions also allow you to summarize or group at multiple levels, including the state level, the plot level, the subplot level and the tree level. You can also group by species level and size class using this function. Grouping by ownership is also possible, which allows federal lands to be summarized differently than private lands. The grpBy argument also allows you to specify multiple grouping variables.

fia_ri_rec <- clipFIA(fia_ri, mostRecent = TRUE)

# State level tree population estimates
tpa_ri <- tpa(fia_ri)

# Plot level tree population estimates for most recent survey
tpa_ri_plot <- tpa(fia_ri_rec, byPlot = TRUE)

# Species level and size class level tree population estimates
tpa_ri_sppsz <- tpa(fia_ri_rec, bySpecies = TRUE, bySizeClass = TRUE)

# by Ownership
tpa_ri_own <- tpa(fia_ri_rec, grpBy = OWNGRPCD)

The package also has some built-in plotting functions that help you visualize the data created by the previous step. The plot below is the trees per acre for the state of RI over time.

head(tpa_ri)
## # A tibble: 6 x 8
##    YEAR   TPA   BAA TPA_SE BAA_SE nPlots_TREE nPlots_AREA     N
##   <dbl> <dbl> <dbl>  <dbl>  <dbl>       <int>       <int> <int>
## 1  2005  516.  112.  10.6    3.89          66          67    95
## 2  2006  496.  111.   8.85   3.23          93          95   140
## 3  2007  501.  112.   6.89   3.29         125         126   198
## 4  2008  499.  115.   6.86   3.36         121         122   193
## 5  2009  516.  116.   6.86   3.21         116         118   191
## 6  2010  504.  118.   6.95   3.19         118         119   196
plotFIA(tpa_ri, y = TPA, plot.title = "Trees per acre")

Aquarius and NWIS Water Data

The goal of this demonstration will be to retrieve both an NPS Aquarius continuous dataset and a USGS NWIS continuous dataset through R, then combine them to do some basic plotting.

The code required will need to be saved to a directory in your project directory or (more easily) accessed from my github. The source code is Aquatic Informatic’s, but I have made some modifications, and added some basic tools to streamline the process. The required code can be accessed via the “AndrewBirchHydro/albAquariusTools” GitHub repository, but should be imported from Github in the first code chunk.

Retrieving Water Quality Data from Aquarius

First, make sure you are connected to the VPN. Then use the chunk below to import the required tools and connect to Aquarius:

#the following toolbox can be pulled directly from my github:
#source("Aquarius basics.R") or....
source("https://raw.githubusercontent.com/AndrewBirchHydro/albAquariusTools/main/Aquarius%20basics.R")
timeseries$connect("https://aquarius.nps.gov/aquarius", "aqreadonly", "aqreadonly")
publishapiurl='https://aquarius.nps.gov/aquarius/Publish/v2'

#we also will need to load a few packages. If you do not have these, you will need to install them in the console.
library(dataRetrieval)
library(lubridate)
library(ggplot2)
library(jsonlite)
library(httr)

After running the chunk above, you should be connected and can now access NPS Aquarius stations and data.

By assigning the variable, “Network” below, you can get a list of sites within a given network:

Lets start by pulling all locations in SCPN:

#assign the network name of interest
Network <- "Southern Colorado Plateau Network"

#this function will print the available locations in that network
Print_Locations(Network = Network)

#if you want a dataframe of the locations, you can assign it to 'Locations_in_network' here:
Locations_in_Network <- as.data.frame(Print_Locations(Network = Network))

Once you have selected the location of interest, you can use the print_datasets() function to get information on all of the available time series for a selected location:

For this exercise, we will go with the location Rito de los Frijoles at BAND Visitor Center with the identifier: ‘BANDRIT01’

Entering this location identifier into the Print_datasets() function (easiest to copy and paste from the table above) will provide a list of datasets available at this location:

#enter the location of interest to get a list of datasets:
Print_datasets(Location = "BANDRIT01")

#you can also print available metadata for a given location based on it's identifier:
Print_metadata(Location = "BANDRIT01")

In the table above, the available datasets for Rito de los Frijoles at BAND Visitor Center include air temperature, water temperature, and voltage from their respective instruments. Lets take a look at the water temperature record.

Using the Get_timeseries2() function, we will retrieve this dataset and assign it to the variable raw_record. We can then use the TS_simplify() function to do some basic data wrangling and convert it to a cleaned up dataframe.

Enter the Identifier for the dataset below to retrieve the raw record, then use the function TS_simplify() to convert it to a useable dataframe:

#this line will pull the dataset from Aquarius
raw_record <- Get_timeseries2(record = "Water Temp.AquaticLogger@BANDRIT01")

#this function will "clean it up"
temp_data <- TS_simplify(data = raw_record)

We should now have a simplified continuous time series of water temperature at Rito de los Frijoles at BAND Visitor Center with columns for water temperature, and the date and time.

Lets take a quick look at the dataset:

# some summary information
head(temp_data)
##             date_time  value
## 1 2007-12-05 09:00:00 12.292
## 2 2007-12-05 09:15:00 12.678
## 3 2007-12-05 09:30:00 13.257
## 4 2007-12-05 09:45:00 13.882
## 5 2007-12-05 10:00:00 14.218
## 6 2007-12-05 10:15:00 14.170
summary(temp_data)
##    date_time                       value        
##  Min.   :2007-12-05 09:00:00   Min.   :-11.686  
##  1st Qu.:2011-06-04 23:41:15   1st Qu.:  3.367  
##  Median :2014-03-25 00:52:30   Median :  9.176  
##  Mean   :2014-09-26 14:14:41   Mean   :  9.882  
##  3rd Qu.:2018-05-16 18:33:45   3rd Qu.: 15.748  
##  Max.   :2021-10-29 14:00:00   Max.   : 40.761  
##  NA's   :48
# a quick plot
ggplot(temp_data, aes(x = date_time, y = value)) +
  geom_path() +
  theme_bw() +
  xlab("Time") +
  ylab("Water Temperature (C)")


Retrieving Discharge data from NWIS

Lets say we want to pair this temperature dataset with discharge for an analysis.

There were no discharge data available in Aquarius for this site, but after looking at a map, there is a nearby NWIS station at RITO DE LOS FRIJOLES In BANDELIER T MON, NM. The station’s ID number is 08313350

We can pull discharge data from NWIS to pair with the temperature record using the dataRetrieval package.

The package includes a function readNWISuv() which we will use to grab the discharge data. It requires variables for the site number, USGS parameter code, and start/ end dates, which we will assign below (the USGS code for discharge is 00060)

#00060- USGS code for discharge

siteNo <- "08313350"
parameter <- "00060"

#for start and end dates, you can enter basically all of known time to get whatever data is available, or narrow it down to a particular window. In this instance, lets go from 1900 to 2025 to ensure we pull everything available
start.date <- "1900-01-01"
end.date <- "2025-01-01"

With the required variables defined, we can now use the readNWISuv() function to pull our discharge dataset: (it may take a few seconds if it is a long record)

NWIS_query <- readNWISuv(siteNumbers = siteNo,
                     parameterCd = parameter,
                     startDate = start.date,
                     endDate = end.date)

The dataframe will come in with some goofy USGS formatting and column names, which can be reassigned easily to match our Aquarius temperature record:

colnames(NWIS_query) <- c("agency", "identifier", "date_time", "Q_cfs", "time_zone")

Lets take a quick look at the discharge data:

head(NWIS_query)
##   agency identifier           date_time Q_cfs time_zone  NA
## 1   USGS   08313350 1993-10-01 06:15:00  0.72    A [91] UTC
## 2   USGS   08313350 1993-10-01 06:30:00  0.72    A [91] UTC
## 3   USGS   08313350 1993-10-01 06:45:00  0.72    A [91] UTC
## 4   USGS   08313350 1993-10-01 07:00:00  0.72    A [91] UTC
## 5   USGS   08313350 1993-10-01 07:15:00  0.72    A [91] UTC
## 6   USGS   08313350 1993-10-01 07:30:00  0.72    A [91] UTC
summary(NWIS_query)
##     agency           identifier          date_time                  
##  Length:383517      Length:383517      Min.   :1993-10-01 06:15:00  
##  Class :character   Class :character   1st Qu.:1996-02-26 23:15:00  
##  Mode  :character   Mode  :character   Median :2013-03-26 22:20:00  
##                                        Mean   :2006-12-22 17:29:37  
##                                        3rd Qu.:2015-06-17 00:20:00  
##                                        Max.   :2016-10-30 05:55:00  
##      Q_cfs           time_zone              NA           
##  Min.   :   0.000   Length:383517      Length:383517     
##  1st Qu.:   0.500   Class :character   Class :character  
##  Median :   0.820   Mode  :character   Mode  :character  
##  Mean   :   1.577                                        
##  3rd Qu.:   1.180                                        
##  Max.   :9500.000
# we'll plot both the water temperature and discharge records for a quick visual comparison
ggplot(data = NWIS_query, aes(x = date_time, y = Q_cfs)) +
  geom_path() +
  theme_bw() +
  scale_y_log10() +
  xlab("Time") +
  ylab("Discharge (cfs)")

ggplot(temp_data, aes(x = date_time, y = value)) +
  geom_path() +
  theme_bw() +
  xlab("Time") +
  ylab("Water Temperature (C)")

Uh oh! Taking a look at the data above, there is a significant data gap in discharge from this station from around 1997-2013. This is fairly common, and emphasizes the need to always visually inspect data prior to conducting any kind of analysis.

There is some overlap between the two records however, so lets merge them by date_time to get a combined time series where the records overlap.

# we will enter FALSE for all.x and all.y, as we only want the period where the records overlap one another:
combined_record <- merge(x = NWIS_query, y = temp_data, 
                         by = "date_time", all.x = F, all.y = F)

We should now have a record of both water temperature from Aquarius, and discharge from NWIS. Lets take a quick looks at it:

head(combined_record)
##             date_time agency identifier Q_cfs time_zone  NA value
## 1 2012-07-05 06:00:00   USGS   08313350  0.43         A UTC  17.4
## 2 2012-07-05 06:15:00   USGS   08313350  0.40         A UTC  17.3
## 3 2012-07-05 06:30:00   USGS   08313350  0.40         A UTC  17.3
## 4 2012-07-05 06:45:00   USGS   08313350  0.40         A UTC  17.2
## 5 2012-07-05 07:00:00   USGS   08313350  0.43         A UTC  17.1
## 6 2012-07-05 07:15:00   USGS   08313350  0.40         A UTC  17.1
summary(combined_record)
##    date_time                      agency           identifier       
##  Min.   :2012-07-05 06:00:00   Length:55844       Length:55844      
##  1st Qu.:2013-06-06 23:56:15   Class :character   Class :character  
##  Median :2014-09-05 15:07:30   Mode  :character   Mode  :character  
##  Mean   :2014-10-04 03:29:06                                        
##  3rd Qu.:2016-04-10 16:37:30                                        
##  Max.   :2016-10-30 05:45:00                                        
##      Q_cfs           time_zone              NA                value       
##  Min.   :   0.000   Length:55844       Length:55844       Min.   : 0.121  
##  1st Qu.:   0.370   Class :character   Class :character   1st Qu.: 9.373  
##  Median :   0.670   Mode  :character   Mode  :character   Median :14.709  
##  Mean   :   2.081                                         Mean   :13.933  
##  3rd Qu.:   1.140                                         3rd Qu.:18.140  
##  Max.   :9500.000                                         Max.   :35.542
# lets take a basic look at the time series together- not that since they are on different scales, a secondary axis is required
ggplot(combined_record, aes(x = date_time, y = Q_cfs)) +
  geom_path(color = "blue") +
  geom_path(color = "red", aes(x = date_time, y = value * 500)) +
  theme_bw() +
  scale_y_continuous(name = "Discharge (cfs)", sec.axis = sec_axis(~ . / 500, name = "Water Temperature (C)"))

# lets plot temperature and discharge against one another to see their relationship:
ggplot(combined_record, aes(x = Q_cfs, y = value)) +
  geom_point() +
  theme_bw() +
  xlab("Discharge (cfs)") +
  ylab("Water Temperature (C)") +
  scale_x_log10()

Not much information here is there? What if we want to see the temperature-discharge relationship by month or year?

Lets assign columns for month and year, and recreate the plot based on those parameters.

# This will require the lubridate package
library(lubridate)

# this will create a new column with the numeric month derived from the date_time column as a factor
combined_record$month <- as.factor(month(combined_record$date_time))
# this will create a new column with the year derived from the date_time column as a factor
combined_record$year <- as.factor(year(combined_record$date_time))

Now lets recreate our plot using this new month column for color faceted by year.

ggplot(combined_record, aes(x = Q_cfs, y = value, color = month)) +
  geom_point() +
  theme_bw() +
  xlab("Discharge (cfs)") +
  ylab("Water Temperature (C)") +
  scale_x_log10() +
  ggtitle("2013") +
  facet_grid(year ~ .)


Resources

  • rFIA website which contains background on the package, along with tutorials for using the data.
Using ggplot2


Visualizing Spatial Data
  • Geocomputation in R is an excellent online book on the sf and tmap packages.
  • R Studio’s leaflet for R page shows the basics of how to use leaflet.
  • R-spatial mapping in ggplot The r-spatial.org site is a blog with lots of interesting topics covered. The blog I linked to helped me figure out how to map shapefiles in ggplot.


Day 2: Functional Programming

Functions Used Today

We want to provide you with a summary of functions and operators that we will be using in this section. We suggest using R help to read up on them before class. You don’t have to master them, but be generally aware of them. You can use these help functions to look up the functions and operators below:
    help() - searches help files for packages loaded into namespace
    ? - searches help files for packages loaded into namespace
    ?? - searches all help files regardless of wether or not they are loaded

Base operators and things that look like them. When looking these up, they must be wrapped in ' ' = <- == ! != > >= < <= %in% | &

base (and packages in base) library() [] rm function(){} data.frame for(){} apply() tapply() sapply() lapply() do.call() rbind() names(), setNames(), colNames() nrow() rpois() quantile() plot() vector() read.csv() class() is.numeric() is.na() mean if(){}else{} return() glm() all() map_if()

dplyr %>% and . filter() mutate() case_when bind_rows() select() starts_with() rename() date() group_by() across() summarize() pivot_longer()

tidyselect where

tidyr unite pivot_longer

microbenchmark microbenchmark()

ggplot2 ggplot() geom_point() geom_line() facet_grid() ggsave()

purrr map map2 map_dbl discard()

lubridate mdy() mdy_hms()

modelr add_predictions()

Introduction

Functional Programming

I will leave you to Google this and dive into the rabbit hole that is the definition of functional programming. This isn’t about writing code that works, it is a technical thing. In short, a “functional” is a higher order function that takes a function as one of its inputs. For our purposes “functional programming” will focus on iterative functionals (apply family and map) and how to make functions that can be passed to functionals.

The goal of functional programming is more stable, transparent, and reliable code.

Overview of functions

This module will provide a look at simple and moderately complex ‘functions’ in R. We will look at how to make a function and then we will look at how you apply that function in iteration. The goal with this is to equip you with the two most powerful tools any R user can have. 1) The ability to create a function. 2) The ability to use that function to get a lot of repetitive tasks done quickly. You already learned a little bit about this second one on the Day 4 iteration and you saw us using some functions there.

Although we are calling this the advanced training, the approaches here are intended to get you started, but are not exhaustive. As you practice these fundamentals you will quickly find better ways to get some of the things you want to do done.

What is a function?

Thomas - A function is a container for a series of steps that are performed on data. It has some fixed expectations about input and output - another way of saying that is that a function expects a specific pattern(s) for input and for output. A function is also an expression of what another person thinks the right way to do something is. The nice thing is that if you don’t like all that, you can write your own function, but be selective about that.

JP - Functions are a way of taking a large problem and breaking it down into simpler discrete steps. Each function can just focus on one step and makes it easier to do that step in isolation. You can then reuse the functions to help solve new problems in the future.

Everything is a function. But what is it really? A function usually has 3 components - the function name, arguments, and the body or source code.

Hang on, what was that ‘usually’ jazz? Okay, there are ‘named functions’ and ‘anonymous functions.’ The difference is that when you plan on reusing a function, you give it a name (3 components). If you don’t plan on using it ever again, you don’t give it a name (2 components) and it is called an ‘anonymous function.’ I am going to show you examples of both, but not get too hung up on the taxonomy.

Anatomy of a function

Let’s look at mean(x) as an example:

The name of the function is simply “mean”. Function names should be simple and somewhat intuitive (If your function calculates the data mean, but you name it “Pat” that doesn’t make sense). You should also be careful not to give you function the same name as something that exists in base R or in a package that you might commonly use. This can cause conflicts in your environment and unexpected results. R is pretty good about warning you about this.

The arguments of the function are what you put inside the parentheses. Arguments tell the function what the data are and they tell it how to handle the data. In this case mean(x) is telling the mean function that the data to operate on are x.

Almost all functions have more than 1 argument, however most of the time you are only specifying 2 or 3 when you call the function. If you want to know what arguments a function can accept, help() will take you to a hopefully useful explanation. In there you can see what arguments have defaults, what those defaults are, and when you should think about changing the defaults.

The last bit of a function is the source code (“source” for short). This is what the function does. 95% of the time you can safely ignore the source. However, it is useful to look at the source when you want to understand why a function is doing what it is doing, modify a function, see what arguments it can accept, what the defaults are, etc.

How do you get to the source? These days, for non-base packages a lot of this can be found on GitHub (eventually). The fastest way, is to type the function name without the () into the console it will return the underlying source code for that function if it is available.

#run these in the console without the () to see what lies underneath
mean

lm

Some functions have more information on what they are doing some don’t. mean doesn’t have much to show us. That is because it is a compiled function and part of the R source code. If you need to or already have the ability to dig down into compiled functions you probably don’t need to be in this course. But, see reference [1] if you want to try!

When you type in lm you get a lot more information out and you can see there there is a fair amount of code that is being used to execute lm.

Creating a Simple Function

When do you need to create a function?

  • When what is available doesn’t do what you want it to do.
  • When you are about to do something more than twice.
  • When you see a repeating pattern in the code you are writing.
  • When you are recycling the same code and changing variable names or values.

Examples: + You want to create separate linear models for the relationship between Chloride and specific conductance at 25 sites. You could analyze each site as a separate model or you could write a function that could work through each site. + You need to rename 250,000 acoustic recorder files every year. You could resign from your job or you could write a function. + What are your examples?

Best Practices: What should your function look like?

There are some rules to keep in mind for your functions before we get into developing them. + Avoid choosing names that are not intuitive or are already taken. + Argument names can (arguably ‘should’) be the same as other parameters or variable names in other functions. + One function for each task vs one function to rule them all? Air on the side of 1 function for each task. But sometimes you do really want a super function. + Make sure your function is interpretable. #Comments accomplish this. + Figure out where to draw the line between the function and the functional (iteration).

Steps for function development

Below are some of the steps we find ourselves following when we need to develop functions. + Verbalize what you want the function to do and do some googling + Identify the pattern or patterns in the data your function will need to operate on. + Decide what you want the function to output. + Within the expected pattern, set up test cases (test data). I like a clear positive case and a negative case. + Do some programming. + Test the function until you are satisfied.(Debugging, See day 4) + Apply it to the data.

Modifying a simple function

Let’s create our first function by changing the default behavior of an existing one

set.seed(12345) #gives everybody the same data

d<-c(floor(runif(100)*100),NA) #generate random data

mean(x=d) #unexpected result

The response is NA, which is not what we want (maybe). It is easy enough to address this with mean(d,na.rm=T), but we may not want to do that many times throughout our code.

mean2<-  #Tell [R] that I want this new function to be named "mean2"
       function(x){  #the function consists of 1 parameter named x (aka the data) The { begins the function source code / expressions. 
                   mean(x,na.rm=T) #in the mean function change the default for na.rm=T
                   } #close function

Now let’s check its behavior.

mean2(x=d) #more expected result

That handled the NA value without giving an error. What if we want to switch that back?

mean2(x=d, na.rm=F)

How you create your function affects what it can use as ‘arguments’.

When we set up our new function we did not tell it na.rm is an argument. We fixed it in the source code. If you want something as a parameter, it must be listed in the parentheses.

mean3<- function(x,na.rm=T){mean(x=x, na.rm=na.rm)} 

mean3(d)

So now we have just made mean with the na.rm set to true and we can change that if needed.

mean4<- function(x,na.rm){#very minor change. I deleted the initial parameter value
                          mean(x=x, na.rm=na.rm)} 
mean4(d)

What do you think is going to happen without the initial parameter value?

It didn’t work… or did it? How you set up your functions is partially about what you really intend for it to do. I would argue that none of these are wrong. They each have a different use and set of assumptions in their use. mean - assumes that you want an NA if data are missing but might want to change that behavior. mean2 - assumes you always want to ignore NAs and have no reason to change that behavior mean3 - assumes you mostly want to ignore NAs but might want to change that behavior mean4 - assumes nothing and forces you to explicitly state how you are going to handle NA values.

A few final things on the basics of how functions function. You will see the coding for a simple 1 line function expressed 2 ways:

mean5<- function(x,na.rm){mean(x=x, na.rm=na.rm)} #always works

mean5<- function(x,na.rm) mean(x=x, na.rm=na.rm) #only works on one line

If a function can be expressed with a single line you do not need curly brackets. If the function is on more than one line you must use curly brackets. There are some other details on this, but that is all you really need to know. My preference is that you should always use the curly brackets even if you don’t need to.

Functionals

Functionals

I want to come back to what “functionals” are now. For our purposes, a functional is a higher-order function that takes a function and data as their inputs (maybe a few other things as well). The functional iterates that function over the data in a predefined way and returns a certain output (.e.g vector, list, dataframe, etc). This is one of the reasons why functionals are preferred over for loops. Functionals have an expected behavior. With for loops you are defining the behavior. With functionals, that is largely done and you are just picking the one or combination that does what you want.

Which of the following is a correct statement? +The for loop isn’t iterating correctly. +The functional isn’t iterating correctly.

Can you think of any reasons why we would want to sacrifice the flexibility of a for loop for the rigidity of a functional?

We are going to use two functional families apply (base R) and map (purrr) - there are more out there, but these two are the easiest to understand. We will build some general approaches in apply and then rebuild them in map.

Useful Iteration

Creating a few useful functions

On day 4 as we learned about iteration and were using functions. I told you to ignore the whole why they were functions thing for now. Let’s go back and pick those apart a little more and make some useful functions for that hobo temperature data. We will make a summary function, a plot function, and a modeling function.

Read in data

Let’s read in that Hobo air temperature data again and take a closer look at the functions and functionals. As a reminder there will be 4 columns: + idx is a sequential index column of integers inteded to track the original data order as the data are manipulated. + DateTime is a datetime that is going to read in as a character. + T_F is going to be temperature in Fahrenheit. + Lum is going to be a measure of light intensity in lux (lumens/ft2).

library(ggplot2);library(magrittr)
#get that data
fNames<-c("APIS01_20548905_2021_temp.csv",
          "APIS02_20549198_2021_temp.csv",
          "APIS03_20557246_2021_temp.csv",
          "APIS04_20597702_2021_temp.csv",
          "APIS05_20597703_2021_temp.csv")

fPaths<-paste0("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/", fNames) 

HoboList<-lapply(fPaths, FUN=read.csv, skip=1, header=T)%>% #1. read hobo data into a list
          lapply(., "[",,1:4)%>% #2. Grab only first 3 columns. Empty comma is not an error
          lapply(., setNames, c("idx","DateTime","T_F","Lum"))%>% #3. set col names
          lapply(., dplyr::mutate, DateTime2=as.POSIXct(DateTime, "%m/%d/%y %H:%M:%S", tz="UCT"))%>%#4. format datetime in new variable. 
          setNames(., fNames) #5. name each one for tracking

This was actually our first functional programming exercise and we did 4 of them. See numbering in comments above: 1. Take each element of the vector fPaths and pass it to the function (FUN) read.csv one by one and return a list of data frames. Not the arguments that come after the function identification are arguments for the specified function. 2. You may be surprised to know that the indexing notation [] is actually a function and can be called using "[" and passing it any row and column information. 3. Rename the columns. 4. Create a new variable that formats the data as datetime so it plots correctly.

So, that was the application of canned functions in a functional, on to the new functions.

Summary Statistics for the data

The way we did summary statistics before was a little clunky. It was only really appropriate if that was the only place we would be doing it that way. Let’s assume we will do it that way for a number of different datasets in different workflows.

Let’s strategize a little. I have two phases to consider. The function and the iteration.

Function considerations + Hobo data has 3 columns. Is this just for hobo data? Doesn’t have to be. Maybe the column number will change if I decide to analyze light data at some point. I probably want the target column to be an argument so I can change it. + I want to get the output for a bunch of simple statistics. + All will operate on the same data. + All have singular outputs. + na.rm is defaulted to FALSE in all of then and I would like it to be true. + I am inputting a list of data, but I want a dataframe or something easily coerced into one back. + I want to be sure I am properly tracking names of columns and files in this.

Iteration considerations + Inputting a list of dataframes but I want a single dataframe in return that summarizes all of them. + I want to be sure I am properly tracking names of columns and files in this.

Now that I have put some thought into it, let start coding. Note on code: Primarily intended for demonstration purposes. Can be done more cleanly in purrr::map.

hobo_summary <- function(x, col=3) {   
   funs <- c(mean, median, sd, mad, IQR) #list of functions
             unlist(                     #unlist simplifies somewhat ugly output
             y<-lapply(funs, function(f){f(x[,col], na.rm = TRUE)})%>%
                setNames(.,c("mean", "median", "sd", "mad", "IQR"))
             )
} #credit to [5] for this general example idea

summarized_data<-lapply(HoboList, FUN=hobo_summary)%>%
                        do.call("rbind", . )%>%   
                        as.data.frame( . ) 

Questions: + How many arguments does hobo_summary take? How many have a default value? + What is funs? + What is lapply(funs, function(f){f(x[,col], na.rm = TRUE)}) passing to the anonymous function? + What is that anonymous function doing? + What is lapply(HoboList, FUN=hobo_summary) passing to hobo_summary? + What are the benefits and drawbacks of setting the column names in hobo_summary? + What is unlist doing?

Build a plotting function

I find that rapidly generating a pile of plots is often useful for some initial QA/QC. Do my data look about like I expect? This allows me to quickly generate a few or hundreds of plots and flag some data for closer inspection.

Strategy time

Function considerations + I need to make a simple plot. Plot would work fine, but find ggplot makes saving the plot to a variety of formats easier because of ggsave.

  • Hobo data has 3 columns.

  • Maybe some files have some or all data missing. ggplot handles missing data reasonably well

  • Time series data. ggplot can plot that just fine.

  • I want a plot title, no plot title in file name. I only know plot title relatively.

Iteration considerations + In each file the column locations and data contained are the same.

  • Each of the 5 have similar names - could be useful for iteration.

  • I strongly suspect missing data.

  • I want to name the output plots using their source filenames. I can name the elements in the data list so I know which is which, but lapply doesn’t pass that information along. I can reference a separate vector, but that reduces the flexibility of the code. If I am going to use lapply, this means I need to get a little creative.

ggplotCustom<-function(i, j, pattern=".csv", replacement="_plot.pdf", path=choose.dir(), device="pdf", height=5, width=5,units="in"){
              p<-ggplot(data = j[[i]], aes(x=DateTime2, y=T_F))+
                 geom_point()+
                 ggtitle(names(j)[i])
              ggsave(filename=gsub(pattern=pattern, replacement = replacement, names(j)[i]),
                     path=path, plot=p,  device=device, 
                     height=height, width=width, units=units)
            }

#do some quick testing
ggplotCustom(i=1, j=HoboList, path= "C:/Users/tparr/Downloads/Training_Output/") #test to see if function is working for a positive case

ggplotCustom(i=3, j=HoboList, path= "C:/Users/tparr/Downloads/Training_Output/") #test to see how it behaves on a negative case

#now iterate
lapply(seq_along(HoboList), FUN=ggplotCustom, HoboList, path="C:/Users/tparr/Downloads/Training_Output/")

Someone’s head just exploded with all that so let’s pause and unpack this because I threw a lot in here. Let’s start by talking about some of the logic in the arguments list. function(i, j, pattern=".csv", replacement="_plot.pdf", path=choose.dir(), device="pdf", height=5, width=5,units="in"):

  • What is i?
    In this case, i is going to be an index that is the number of list elements.

  • Where are we using pattern and replacement?

  • Then we defined a bunch of stuff for ggsave. Saving is pretty fiddly and depending on how I am going to use the function I may want to change those save arguments without rewriting the entire function. If I am happy with the defaults, I don’t need to state them each time.

  • choose.dir - if you don’t know where it is going to go, you get prompted to select a directory for output. My intent is that I will always specify the path, but maybe I will forget or something. But then again I am going to end up specifying on each step of iteration.

Now let’s look at how this was fed into lapply: lapply(seq_along(HoboList), FUN=ggplotCustom, HoboList, path="C:/Users/tparr/Downloads/Training_Output/"):

  • seq_along basically creates a vector from 1 to the number of list elements.

  • Where is i? The way lapply works, is that it takes the first element of the list and passes it to the first variable of the specified function. In the above call the first element produced by seq_along is 1 and the first variable of ggplotCustom is i. So it basically says i=1. If we had not used seq_along, this would then take this first element of HoboList which is a dataframe. This is not an undesirable behavior but in this case, we want to be able to reference back to where that dataframe is so we can extract it’s name.

Iterating an Analysis

As you advance in R or have more complex datasets you will want to iterate your analyses. We will work through an example for iterating a linear model analysis For the hobo data. The question will be simple: How does light level affect air temperature?

lm_ls<-function(data,x){mod<-lm(x, data=data); return(mod)}
modlist<-lapply(HoboList[c(2,4,5)], lm_ls, T_F~Lum) #I know some are missing light data

We can now extract some stuff and do some diagnostics. A lot of common tools can be used in a functional.

lapply(modlist, summary)
lapply(modlist, plot)
lapply(modlist, coef)

That was nice, but it wasn’t in the best format for further use and it wasn’t what we really needed. Let’s drill down to an output we like.

lm_stats<-function(mod){ mod_sum<-summary(mod) #may or may not be worthwhile
                         out<-data.frame(
                         intercept= coef(mod)[[1]],
                         slope= coef(mod)[[2]],
                         slp_pval=mod_sum$coefficients[,4][[2]], #see what happens if you run this without the [[2]]
                         R2_adj= mod_sum$adj.r.squared,
                         mod_pval= mod_sum$fstatistic %>% {unname(pf(.[1],.[2],.[3],lower.tail=F))})
                        return(out) 
}

m<-lapply(modlist, lm_stats)%>%
         do.call(rbind,.)%>%
         dplyr::mutate(.,id=rownames(.))%>%
         magrittr::set_rownames(.,1:nrow(.))

You know the drill, what are your questions? + What makes sense? + What doesn’t make sense? + Why am I using the “::” (if we haven’t addressed that already). + Everyone catch what “;” is doing?

Parallel Processing

Parallel Processing

We are dealing with small problems. Small problems seldom push into the limits of your computer. Large datasets and complex functions can take a long time to process (even after you fully optimize them). In R, this is primarily a function of your processor speed. R is only running on a single processing core. In other words, for something like lapply (or map, or foreach) it processes each iteration sequentially on a single core. It doesn’t need to be that way. Most computers have more than 2 cores. you could be executing different independent iteration steps on separate cores and recombining the results. This is called ‘parallel processing’.

There are versions of this out there for lapply but they never seem to work quite right. But, the good news is that there does appear to be a ~new unified framework that can be used for any coding style. So, base, tidy, and foreach approaches can all be easily parallelized using the functions in the future.apply package.

So let’s explore and time a parallelization of lapply. Your code may vary if not on Windows. This will take 1-2 minutes to run depending on your computer.

HoboList2<-c(rep(HoboList,5)) #make the dataset larger 

plan("multisession", workers=parallel::detectCores()-1) #initiate a multicore session, the number of cores to use to 1 fewer than the max detected. Reduces chance of overwhelming the system.
microbenchmark::microbenchmark(
"sequential"={lapply(seq_along(HoboList2), FUN=ggplotCustom, HoboList2, path="C:/Users/tparr/Downloads/Training_Output/")},
"parallel"={future_lapply(seq_along(HoboList2), FUN=ggplotCustom, HoboList2, path="C:/Users/tparr/Downloads/Training_Output/")},
times=5,
unit="s"
)
plan("sequential") #close the multicore session.

My run says that parallelization was 23% faster than sequential. Not a huge speed improvement, but something to keep in mind to try if a chunk of code is taking ~30 minutes to execute. Now I kind of want to go and see if this approach can speed up file.copy.

One important thing to remember is that initiating a parallel session can slow down your computer significantly if not done properly. Best to test it with small data then scale up.

Iteration with purrr

Custom functions are the most useful when you have a complicated task that you wish to do many times.

Step 1: Write and test the function.

Step 2: Iterate the function

This section focuses on the many options available for Step 2

We will need the tidvyerse , lubridate and modelr libraries

library(tidyverse)
library(lubridate)
library(modelr)
for() Loops

for() loops run the same code a certain number of times. Each time you run through the loop the code is executed. You need to have an index that tells the loop how often to run. This is specified in the for() function.

## First statement makes i the "index" variable  it will run once for each i
x <- NA
for (i in 1:100) {

  ## get 10000 random numbers from a Poisson distribution with a mean i  (changes)
  Samples <- rpois(10000, i)

  # get 95% upper qauntile - e.g. the number that is higher than 95% of the other samples
  Q95 <- quantile(Samples, 0.95)

  ## save that to vector "x" using [i] to make sure it goes to the right location
  x[i] <- Q95
}

# plot our results

plot(x)

for() loops are very common in other programming languages.

In R they are less popular:

  1. The can be very slow. One error (shown in the example above) is to “grow the object orgnaically”. Before I started the loop I assigned the values of NA to x, so x had just one entry. As the loop went along it first overwrote the NA and then on every pass after the first one it added a new value. Growing x this way requires a lot of rewriting of the x which is slow. For this example you don’t notice it, but if x was a large dataframe or grew to be millions of values you would. This can be avoided by telling R how big the object will be right frome the start. So before the loop O could write x<-vector(mode="double", 100) to create a vector that is 100 entries long and is numeric. This will speed up the loop.

  2. Loops, especially nested loops can be hard to read and debug if you come back to the code after a long time.

  3. R has other options that are faster and easier to read.

rm(i, Q95, x, Samples)


Iteration using the purrr package - data cleaning

General purrr references:

Cheat sheet

Iteration Chapter in R for Data Science

purrr is a package that is part of the tidyverse that focuses on iterating over lists, such as lists of data.frames or model results. Combining custom functions and this package can be extremely useful in quickly doing repetitive data manipulation, analysis and visualization. If your task basically consists of starting with a lot of datasets (or one dataset that you then divide by site, species, etc. ) and then doing the same thing to all the datasets, purrr is a good option to consider.

If you are starting out with one data.frame and need to make it into a list of data.frames based on a column, you can usedata %>% split(f=as.factor(data$column)). You can also use dplyr functions like so : data %>% group_by %>% group_split() from dplyr can do that, but if you want to name the lists you will have to do that afterwards by hand using the names() function like: names(MyList)<-c("NameOne","NameTwo") which can be annoying.

We will walk through an example analysis using purrr functions to do iteration.

#file names
fNames <- c(
  "APIS01_20548905_2021_temp.csv",
  "APIS02_20549198_2021_temp.csv",
  "APIS03_20557246_2021_temp.csv",
  "APIS04_20597702_2021_temp.csv",
  "APIS05_20597703_2021_temp.csv"
)

# Make path to files
fPaths <- paste0("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/", fNames)

First, I will read in all the data. My plan it to make a list of data.frames and then carry out all the analysis on that list. set_names() is used to name the fPaths vector, and those names will be carried forward to the data list. The map function will take each element of fPaths and feed them one by one into read_csv(). The advantage of doing iteration this way is that it is more concise than a for() loop, but somewhat easier to read than the lapply function. map() takes takes in a list or a vector. You then specify a function to use with a ~ in front of it. The .x tells map where to put the data it gets from fPaths.

fPaths<- set_names(fPaths, c("APIS01", "APIS02", "APIS03", "APIS04", "APIS05"))
fPaths  # This thing is now a vector where each element has a name
##                                                                                                         APIS01 
## "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/APIS01_20548905_2021_temp.csv" 
##                                                                                                         APIS02 
## "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/APIS02_20549198_2021_temp.csv" 
##                                                                                                         APIS03 
## "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/APIS03_20557246_2021_temp.csv" 
##                                                                                                         APIS04 
## "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/APIS04_20597702_2021_temp.csv" 
##                                                                                                         APIS05 
## "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/APIS05_20597703_2021_temp.csv"
## import the site data into a list where each element comes from a different file
intensity_data_raw <- fPaths %>% map( ~ read_csv(file = .x,  skip = 1))

In this case we have a list of datasets

Now we can examine this data:

class(intensity_data_raw) # its a list
## [1] "list"
length(intensity_data_raw) # 5 elements
## [1] 5
class(intensity_data_raw[[1]]) # each element is a data.frame (well, tibble actually)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
intensity_data_raw %>% map_dbl(~ nrow(.x))
## APIS01 APIS02 APIS03 APIS04 APIS05 
##   4554   6287      0   6286   6287
intensity_data_raw %>% map(~ colnames(.x))
## $APIS01
## [1] "#"                                                           
## [2] "Date Time, GMT-05:00"                                        
## [3] "Temp, °F (LGR S/N: 20548905, SEN S/N: 20548905, LBL: APIS01)"
## [4] "Bad Battery (LGR S/N: 20548905)"                             
## [5] "Coupler Attached (LGR S/N: 20548905)"                        
## [6] "Host Connected (LGR S/N: 20548905)"                          
## [7] "Stopped (LGR S/N: 20548905)"                                 
## [8] "End Of File (LGR S/N: 20548905)"                             
## 
## $APIS02
##  [1] "#"                                                        
##  [2] "Date Time, GMT-05:00"                                     
##  [3] "Temp, °F (LGR S/N: 20549198, SEN S/N: 20549198)"          
##  [4] "Intensity, lum/ft² (LGR S/N: 20549198, SEN S/N: 20549198)"
##  [5] "Bad Battery (LGR S/N: 20549198)"                          
##  [6] "Good Battery (LGR S/N: 20549198)"                         
##  [7] "Coupler Attached (LGR S/N: 20549198)"                     
##  [8] "Host Connected (LGR S/N: 20549198)"                       
##  [9] "Stopped (LGR S/N: 20549198)"                              
## [10] "End Of File (LGR S/N: 20549198)"                          
## 
## $APIS03
## [1] "#"                                                        
## [2] "Date Time, GMT-05:00"                                     
## [3] "Temp, °F (LGR S/N: 20557246, SEN S/N: 20557246)"          
## [4] "Intensity, lum/ft² (LGR S/N: 20557246, SEN S/N: 20557246)"
## 
## $APIS04
##  [1] "#"                                                        
##  [2] "Date Time, GMT-05:00"                                     
##  [3] "Temp, °F (LGR S/N: 20597702, SEN S/N: 20597702)"          
##  [4] "Intensity, lum/ft² (LGR S/N: 20597702, SEN S/N: 20597702)"
##  [5] "Bad Battery (LGR S/N: 20597702)"                          
##  [6] "Good Battery (LGR S/N: 20597702)"                         
##  [7] "Coupler Attached (LGR S/N: 20597702)"                     
##  [8] "Host Connected (LGR S/N: 20597702)"                       
##  [9] "Stopped (LGR S/N: 20597702)"                              
## [10] "End Of File (LGR S/N: 20597702)"                          
## 
## $APIS05
##  [1] "#"                                                        
##  [2] "Date Time, GMT-05:00"                                     
##  [3] "Temp, °F (LGR S/N: 20597703, SEN S/N: 20597703)"          
##  [4] "Intensity, lum/ft² (LGR S/N: 20597703, SEN S/N: 20597703)"
##  [5] "Bad Battery (LGR S/N: 20597703)"                          
##  [6] "Good Battery (LGR S/N: 20597703)"                         
##  [7] "Coupler Attached (LGR S/N: 20597703)"                     
##  [8] "Host Connected (LGR S/N: 20597703)"                       
##  [9] "Stopped (LGR S/N: 20597703)"                              
## [10] "End Of File (LGR S/N: 20597703)"

I used map_dbl() instead of map() to look at the number of rows. That function is just like map() except it will return a numeric vector rather than a list. In this case it is handy as I already know that nrow() will return a number, and a vector is a more compact way of showing that. Note that all the lists and the elements of the row number vector are named with the site code.

Other options for map are:
map_lgl() for a logical vector
map_chr() for a character vector
map_int() for an integer
map_dfr() for making a data.frame by binding rows
map_dfc() for making a data.frame by biding columns

When we look at the raw files from the loggers we see they are a Horror!

  1. One file has no data.

  2. One file is missing the luminosity data.

  3. There are a bunch of columns we don’t want.

  4. The serial number is part of many column names, so the columns change with each file.

  5. The column names have lots of spaces, punctuation, a superscript, and even an degree symbol.

I do NOT want to have to fix this one file at a time - particularly if there is a large number of files. I want to get the Date, Temp and Intensity data and not the rest of the columns.

First I will use discard() from the purrr package to just drop the site with no data. (Note keep() is the opposite of discard()).

Then, I will make a function that can take the raw data and correct these issues. Note the use of starts_with() to select the columns whose names start the same but have different serial numbers at the end. This is from a small package called tidyselect. Many of the tidyverse packages use this as a back-end and export these functions. Other similar helpers include ends_with(), any_of(), all_of(), contains(), and where(). Its not a very big package, so its worth checking to see what it has.

I added in a Temp_C column and transformed the temperature data. Using some fuctions from lubrdiate make the Date_Time column be that data type and extract Date and Hour components.

Then I will use map to get the more consistent data

library(lubridate)

intensity_data_raw <- intensity_data_raw %>% discard(~ nrow(.x) == 0)

Fix_Data <- function(data) {
  data %>%
    select(starts_with("Date") | starts_with("Temp") | starts_with("Intensity")) %>%
    rename("Date_Time" = starts_with("Date"), "Temp_F" = starts_with("Temp"), "Intensity" = starts_with("Intensity")) %>%
    mutate(Temp_C = 5 * (Temp_F - 32) / 9, Date_Time = mdy_hms(Date_Time), Date = date(Date_Time))
}


intensity_data <- intensity_data_raw %>% map(~ Fix_Data(.x))

Now all the data is in the same format.

The data is measured on an hourly basis - it might be useful to make a summary by day. I will use the group_by() and across() to summarize each dataset. The across() function lets me select many columns and perform a function on them, without having to know what their names are. This is used for functions like mutate() and filter().

The where() tells across() which columns to work with. The mean() is indicating what will happen to the selected columns. In this case it basically indicates that any column that is numeric should be summarized by its daily mean. By not having to name each column, I don’t need to write special code to account for the missing Intensity column in one dataset.

Summarise_Data <- function(data) {
  data %>%
    group_by(Date) %>%
    summarise(across(where(is.numeric), mean)) 
}

summary_data <- intensity_data %>% map(~ Summarise_Data(.x))

At the end of the datasets are some days where Intensity is always 0. I am going treat that as an error and get rid of those rows. Again across() is the way to accomplish this. The first park of the function indicates which columns to use. The any_of() tells the filter to only work on the Intensity column, if there is one. The ~.x != 0 is saying that 0 should be filtered out of any column that meets the requirements (i.e. any column called “Intensity”).

### Oops - some of the intensity data is 0, fix this for later use in gamma glm().

Zero_Intensity_Fix <- function(data) {
  data %>%
    filter(across(any_of("Intensity"), ~ .x != 0))
}

summary_data <- summary_data %>% map(~ Zero_Intensity_Fix(.x))
purrr- Graphing 1

Now we will write a function to make a ggplot graph and use map() to graph all the data. The where() from the tidyselect package is helpful again.

Summary_Graph <- function(data) {
  data <- data %>%
    pivot_longer(cols = where(is.numeric), names_to = "Measure", values_to = "Value") %>%
    filter(Measure != "Temp_F")

  Graph <- ggplot(data, aes(x = Date, y = Value, group = Measure, color = Measure)) +
    geom_point() +
    facet_grid(Measure ~ ., scales = "free_y")

  return(Graph)
}

Graphs1 <- summary_data %>% map(~ Summary_Graph(.x))

Note that in the Summary_Graph() function I first created the dataset I wanted using pivot_longer() and saved it to data and then made the graph. Because I was making 2 things in this function I assigned the graph to Graph and then used the return() function to indicate what should be the output. Because I made the graph last, by default it would be what the function returns, but as you functions get more complicated it can be good to specify that explicitly. Also, I could have written return(list(data,Graph)) and returned both objects.

purrr- Doing an analysis and graphing results

Now we will use general linear regression to examine the relationship between light and temperature. There are a couple of challenges here:

  1. One of the data sets does not have the Intensity column, so we need to skip it. We could just cut if from the list, but lets assume that you will be getting more datasets in the future and want the function to just handle that kind of error on its own. To accomplish this, the map_if() function is used. This function is similar to a normal if() function, but can be used in the context of iterating over a list.

  2. Regression analysis can get kind of complicated and there are lots of potential arguments to your regression function. We want to be able to experiment with different ways to fit the data without having to specify every argument in advance or constantly rewrite the function. To do this, the ellipsis (...) is used.

Intensity_Regression <- function(data, ...) {
  glm(Intensity ~ Temp_C, data = data, ...)
}


Gauss_Reg <- summary_data %>%
  map_if(~ all(c("Intensity", "Temp_C") %in% colnames(.x)),
    ~ Intensity_Regression(.x, family = gaussian),
    .else = ~ NA
  )

Here is what is going on:

map_if() first checks to make sure that Intensity and Temp_C are both present in the column names of each data.frame If they are, it runs the Intensity_Regression() function. It they are not present it just returns NA.

Intensity_Regression() takes in the data, and the ... indicates that there may be some other arguments as well. These are other arguments are not specified in advance. The ... then appears again in the glm() function, which indicates that whatever these extra arguments are should go to glm(). You can only send the ... to one place inside your function body, you can’t send it to multiple functions. If for some reason you want to store whatever is in the ... inside your function you can use match.call(expand.dots = FALSE)$ `...`.

When I made the Gauss_Reg output I indicated I wanted a Gaussian (normal) distribution for my regression. So, the family argument was passed to glm(). Now I will do it again with a Gamma distribution.

Gamma_Reg <- summary_data %>%
  map_if(~ all(c("Intensity", "Temp_C") %in% colnames(.x)),
    ~ Intensity_Regression(.x, family = Gamma(link="log")),
    .else = ~NA
  )

Lets make a graph of the data vs. fit for each result of the Gaussian regression.

First, we will add the predictions of the model to each dataset using the add_predicitons() from the modelr package.

library(modelr)

summary_data <- map2(.x = summary_data, .y = Gauss_Reg, .f = ~ {
  if ("Intensity" %in% colnames(.x)) {
    add_predictions(data = .x, model = .y, var = "Pred")
  } else {
    .x
  }
})

The map2() function was used. What this does is take two lists - .x and .y and then feeds them into a function in parallel. So what happens here is that the function takes the first site from summary_data and the first regression from Gauss_Regand sees that there is no , Intensity data available so it just returns the original data.frame. Then it takes the second element from each list, sees that there is Intensity data and uses add_predictions() to add a new column to the data and return it. This then repeats down the list. If you have 3+ lists to iterate over, use the pmap() function instead.

Now that we have the predictions in the data.frames, let’s make a graph to compare the model predictions with the data.

Pred_Graph_Data<- bind_rows(summary_data, .id="Site") %>% filter(!is.na(Pred))

ggplot(Pred_Graph_Data, aes(x=Date, y=Intensity))+
  geom_point()+
  geom_line(aes(x=Date, y=Pred, color="red"))+
  facet_grid(Site~., scales="free_y")

We can feed summary_data to bind_rows() and it will turn the list of data.frames into one combined data.frame. The .id argument tells bind_rows() what to name the column that tells you the site name, and the site name is already present as the name of the input list.

Tips and Tricks

Commenting

Adding comments to your code can help future users, including future you, understand what the code does and why you did it. However, comments are more work and too many comments can clutter your code with unhelpful information. Also, as you update and improve your code you should update relevant comments as well. The more comments you have the easier it is for your comments and code to get out of sink.

Some things to consider:

  1. Try to give functions and things created inside of functions names that indicate what they do / are.

  2. Don’t explain every line. If you code is that hard to follow, then there is a problem with your code.

  3. If you use functions from packages (e.g. tidyverse functions), you don’t need repeat basic info from the help.

  4. If your function has different sections that do different things - e.g. prepare data and then use data - it can be helpful to use comments to distinguish each section and explain what it does.

  5. If you need to do something unusual to account for missing data, special cases, bugs in other people’s code - add comments to explain what you are doing.

  6. If there was some obvious way to do something but you could not use it for some reason, indicate why.

  7. If you are using what seems to you to be an obscure package - note why you are using it.

  8. If you have a close curly brace or close parenthesis that is many many lines form the where the opening brace / parenthesis is, it can be helpful to indicate what it is closing.

Performance - Introduction

“Performance” here refers to how fast your code does something.

Faster is generally better, but it often requires more time to write and can include code that is harder to read.

Most of the time you cna ignore performance. If you run the code and are happy with how long it takes, then there is no problem.

Performance can be a consideration when:

  1. You are doing the same analysis many (dozens to hundreds) of times. Think of an analysis of each species at many parks, or each water quality measure at many sites.

2). If you have very large data files that you need to manipulate.

3). If you are doing a simulation study and need to create and manipulate large numbers of data sets.

4). If you are doing some sort of interactive visualization and need results very quickly.


Vectorization

R has “vectorization” - this means there are some functions that take a vector as an input, perform a calculation or operation on each element and return a vector as output. Some examples:

Numeric Vector

# A vector of numbers
x <- 1:10

# Arithmetic is vectorized
x + 1
##  [1]  2  3  4  5  6  7  8  9 10 11

Character Vector

# a character vector (letters)

x <- letters[1:10]

# paste is vectorized
paste("Letter:", x)
##  [1] "Letter: a" "Letter: b" "Letter: c" "Letter: d" "Letter: e" "Letter: f"
##  [7] "Letter: g" "Letter: h" "Letter: i" "Letter: j"

Logical Vector

# a logical vector

x <- c(TRUE, TRUE, FALSE, TRUE, FALSE, FALSE)

# logical operations are vectorized

TRUE & x
## [1]  TRUE  TRUE FALSE  TRUE FALSE FALSE

Date Vector

# a date vector

library(lubridate)

x <- mdy("02/07/2022", "02/08/2022", "02/09/2022", "02/10/2022")
x
## [1] "2022-02-07" "2022-02-08" "2022-02-09" "2022-02-10"
## adding 1 shows you the next day
x + 1
## [1] "2022-02-08" "2022-02-09" "2022-02-10" "2022-02-11"
rm(x)

Most functions in R are vectorized. This includes:

  1. Mathematical functions, including statistical functions.

  2. Logical operators.

  3. Functions such as rowMeans() that work on matrices.

  4. Text manipulation functions.

Many of these functions are written in C which makes them extremely fast.

You SHOULD NOT write a function or a for() loop to do iteration when the pre-existing vectorization will do the same thing.

You SHOULD make use of vectorization in your functions to keep them fast and readable.

When you use the dplyr function mutate() you are making use of vectorization.

Profiling

“Profiling” is looking at each part of a function to find out how long it takes. This can allow you to identify bottlenecks where speeding up your code might be helpful. Then tools such as microbenchmark can be used to compare options. RStudio has profiling tools built in but to use then you need to install the profivs package.

Once you do this you can access the Profile menu in the top menu bar. You can select some code, and use “Profile Selected Line(s)” to run the profiler.

RStudio

This will give you a “Flame Graph” which tells you how much time each part of your function was running. Here we will profile the Fix_Data() funciton from earlier.

RStudio

The graph shows time across the bottom. Looking from the bottom to the top you see how long various parts took. You can see that profvis (the profile tool) in the bottom row was running the whole time. One row up is mdy_hms() from lubridate which takes up most of the time. You don’t see functions like select() or rename() as they ran too quickly. So if you wanted to speed up Fix_Data() the place to look for improvements is in the date conversion.
Benchmarking

Let’s say that you wanted to speed up the Fix_Data() function. Benchmarking is a tool you can uses to compare different versions of a function to see which one runs faster. To do this we will use the microbenchmark package. Earlier the as.POSIXct() function was used to convert the DateTime columns from character to date. Lets see if that is quicker. We will also try the base function strptime.

library(tidyverse)
library(lubridate)
library(microbenchmark)

# Recreate intensity data
fNames <- c(
  "APIS01_20548905_2021_temp.csv",
  "APIS02_20549198_2021_temp.csv",
  "APIS03_20557246_2021_temp.csv",
  "APIS04_20597702_2021_temp.csv",
  "APIS05_20597703_2021_temp.csv"
)

fPaths <- paste0("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/", fNames)

intensity_data_raw <- set_names(fPaths, c("APIS01", "APIS02", "APIS03", "APIS04", "APIS05")) %>%
  map(~ read_csv(file = .x, , skip = 1)) %>%
  discard(~ nrow(.x) == 0)


# Original function
Fix_Data_Lubridate <- function(data) {
  data %>%
    select(starts_with("Date") | starts_with("Temp") | starts_with("Intensity")) %>%
    rename("Date_Time" = starts_with("Date"), "Temp_F" = starts_with("Temp"), "Intensity" = starts_with("Intensity")) %>%
    mutate(Temp_C = 5 * (Temp_F - 32) / 9, Date_Time = mdy_hms(Date_Time), Date = date(Date_Time))
}


# New version using as.POSIXct()
Fix_Data_POSIXct <- function(data) {
  data %>%
    select(starts_with("Date") | starts_with("Temp") | starts_with("Intensity")) %>%
    rename("Date_Time" = starts_with("Date"), "Temp_F" = starts_with("Temp"), "Intensity" = starts_with("Intensity")) %>%
    mutate(
      Temp_C = 5 * (Temp_F - 32) / 9, Date_Time = as.POSIXct(Date_Time, "%m/%d/%y %H:%M:%S", tz = "UCT"),
      Date = date(Date_Time)
    )
}

# new version using strptime
Fix_Data_strptime <- function(data) {
  data %>%
    select(starts_with("Date") | starts_with("Temp") | starts_with("Intensity")) %>%
    rename("Date_Time" = starts_with("Date"), "Temp_F" = starts_with("Temp"), "Intensity" = starts_with("Intensity")) %>%
    mutate(
      Temp_C = 5 * (Temp_F - 32) / 9, Date_Time = strptime(Date_Time, "%m/%d/%y %H:%M:%S", tz = "UCT"),
      Date = date(Date_Time)
    )
}

# Do Comparison
mb<-microbenchmark(
   Lubridate = intensity_data_raw %>% map(~ Fix_Data_Lubridate(.x)),
   POSIXct = intensity_data_raw %>% map(~ Fix_Data_POSIXct(.x)),
   Strptime = intensity_data_raw %>% map(~ Fix_Data_strptime(.x)),
  times = 10L, unit = "ms"
)

mb
## Unit: milliseconds
##       expr      min       lq      mean    median       uq      max neval
##  Lubridate  88.6787  89.6703  94.23664  90.27815 101.4092 107.2358    10
##    POSIXct 298.9119 299.9104 304.35697 302.43415 307.7321 315.7129    10
##   Strptime 162.4105 163.8421 165.89392 164.90895 167.9155 171.5239    10
boxplot(mb)

As it happens, the lubridate version came out the fastest. You should also consider if this data is representative of the work you need to do. It is not unusual for one way of coding to be fastest with large amounts of data, but not have an advantage for small amounts of data.

Day 3: R Markdown

Intro to R Markdown

Background

R Markdown is a special file format that you can open in RStudio (go to File > New File, and “R Markdown…” is the 3rd option). R Markdown takes what you put into the .Rmd file, knits the pieces together, and renders it into the format you specified using PanDoc, which is typically installed as part of the RStudio IDE bundle. The knit and render steps generally occur at the same time, and the terms are often used interchangeably. For example, the Knit button (knit) knits and renders the .Rmd to your output file. The knit shortcut is also super handy, which is Ctrl + Shift + K.

The most commonly used outputs are HTML, PDF, and Word. Each output format has pros and cons, and drives which text editing language (e.g., HTML or LaTeX) you need to use for advanced/custom needs. We’ll cover these in more detail in a minute.

Reasons to use R Markdown We’re all busy and it’s hard to keep up on everything we need to know to do our jobs well. We wouldn’t be promoting R Markdown like we are, if we didn’t feel that it was a huge contributor to our efficiency and productivity. R Markdown has been described as the Swiss Army knife of Data Science, because there are so many things you can do with it. Below are just a few of the many reasons we use R Markdown.
  • My favorite part of R Markdown is that it’s one stop shopping to store code, text/notes, results, etc., as I’m working on an analysis and writing a report. It saves me the hassle of copying and pasting figures and tables from an analysis into Word or PowerPoint, and allows me to write notes as I’m analyzing. These notes often end up going right into the Methods and Results sections my papers.
  • They’re reproducible and easily updated. If the data changes, or you realize there’s an error in your analysis, all you have to do is fix the issue, then rerender the .Rmd to update the output file.
  • You can use parameters (e.g., park code, year, species) to iterate through and generate multiple separate reports for each unique parameter. We’ll show examples of that later.
  • The learning curve is also very shallow at the outset. In about 5 minutes of playing with it, you’ll get the basics of how it works. Then, as you learn more HTML/CSS or LaTeX, the sky is the limit. This website, in fact, is built from R Markdown and hosted as a GitPage for the repo that hosts the code.

Getting Started

To get started, you’ll need to make sure you have the rmarkdown package installed. The knitr package, which does a lot of heavy lifting, is a dependency of rmarkdown, so both will be installed with the line of code below.

install.packages("rmarkdown") 

Once you have rmarkdown installed, you should be able to go to File > New File > R Markdown…, and start a new .Rmd file. After selecting “R Markdown…”, you will be taken to another window where you can add a title and author and choose the output format. For now, let’s just use the default settings and output to HTML. You should now see an Untitled .Rmd file in your R session with some information already in the YAML and example plain text and code chunks. You can also start with a blank .Rmd, which will be more convenient once you get the hang of it.




Anatomy of an .Rmd

YAML

The .Rmd file itself consists of 3 main pieces. There’s the YAML (Yet Another Markup Language) code at the top, which is contained within ---, like the image below. The top YAML is typically where you define features that apply to the whole document, like the output format, authors, parameters (more on that later), whether to add a table of contents, etc. The YAML below is what we’re using for this website. Note that indenting is very important in YAML. The css: custom_styles.css tells Markdown that I want the styles defined in my css, rather than the default styling in Markdown. This is optional, and is here just to show a variation on the default YAML you get when starting a new .Rmd. If you don’t want to use your own custom style sheet, then your YAML would just have the following in one line: output: html_document.

  • other common additions in YAML: headers/footers, table of contents, and parameters


Plain text

This is what it sounds like- it’s just text. You can write anything you want outside of a code chunk, and it will render as if you’re writing in a word processor, rather than as code. Although, note that special characters like % and & may need to be escaped with a / before the symbol, particularly if you’re using LaTeX (more on that later). You can format text using Markdown’s built in functions, like those shown below. For a more detailed list of these formatting functions, check out the R Markdown Cheatsheet. You can also code HTML directly in R Markdown, which I actually find easier the more I get comfortable with HTML. The section below shows how to use the same common styles with R Markdown and HTML and what the output looks like.

The actual text in the .Rmd:

# First-level header

## Second-level header

...

###### Sixth-level header

*italic* or _italic_

**bold** or __bold__

superscript^2^

endash: --

Example sentence: *Picea rubens* is the dominant species in **Acadia National Park**.

The HTML version:

<h1>First-level header</h1>

<h2>Second-level header</h2>

...

<h6>Sixth-level header</h6>

<i>italic</i>

<b>bold</b>

superscript<sup>2</sup>

endash: &ndash;

Example sentence: <i>Picea rubens</i> is the dominant species in <b>Acadia National Park</b>.

The text renders as:

First-level header

Second-level header

Sixth-level header

italic or italic

bold or bold

superscript2

endash: –

Example sentence: Picea rubens is the dominant species in Acadia National Park.



Challenge: Specifying other headers

How would you specify header level 4?
Answer
#### This is how to specify header level 4, which renders as:

Header level 4

Code chunks

Code chunks are also what they sound like. They’re chunks of R code (can be of other coding languages too), which run like they’re in an R script. They’re contained within back ticks and curly brackets, like below.

```{r}

```

`````

Code Chunk Options

You can customize the behavior and output of a code chunk using options within the { }. Common chunk options are below:
  • echo = TRUE prints the code chunk to the output. FALSE omits the code from output.
  • results = 'hide' omits results of code chunk from output. show is the default.
  • include = FALSE executes the code, but omits the code and results from the output.
  • eval = FALSE does not execute the code chunk, but can print the code, if echo = TRUE.
  • error = TRUE allows a code chunk to fail and for the document to continue to knit.
  • cache = TRUE allows you to cache the output associated with that code chunk, and will only rerun that chunk if the code inside the chunk changes. Note that if the objects or data in the code chunk are changed, but the code within the chunk is still the same, the code chunk won’t realize that it needs to rerun. You need to be careful about using the cache option.
  • fig.cap = "Caption text" allows you to add a figure caption.
  • fig.width = 4; fig.height = 3 allows you set the figure size in inches.
  • out.width = 4; out.height = 3 allows you set the figure or table size as a percentage of the container/page size.
  • message = FALSE, warning = FALSE prevent messages or warnings from chatty packages from being included in the output.


See the R Markdown Cheatsheet for a complete list of code chunk options.
Another trick to using code chunk options, is that they can be conditional based on the results from another code chunk. For example, I have a QC report that runs 40+ checks on a week’s worth of forest data, but the report only includes checks that returned at least 1 value/error. Checks that returned nothing are omitted from the report using conditional eval. I’ll show an example of that later.

Inline Code Chunks

You can also reference objects in your R session within the plain text using back ticks (i.e., `). In the example below, I’ll first calculate a mean value that I’ll pretend is the average number of species on a plot. Then I’ll use inline code chunks to print that number in the plain text.

First, we calculate numspp with echo = F. Here I’m using the runif() function to randomly generate 50 numbers from 0 to 100 using the uniform distribution, just to have some data to summarize. Here, the include = F will run the code chunk, but won’t show the code or the results in the markdown document.

```{r, include = F}
numspp <- round(mean(runif(50, 0, 100)), 1)
```

`````

To use this in a sentence, the plain text looks like this:

The average number of species found in each plot was 50.4.

And renders like this:

The average number of species found in each plot was 50.4.

Challenge: Code chunk options

  1. How would you run a code chunk, but only show the results, not the code?
  2. How would you run a code chunk, but then only include the results in the plain text?
Answer

Answer 1. Include in the code chunk options: echo = F
Answer 2. Include in the code chunk options: include = F, and then use inline coding with

0.9856719.


Output types

For all of the output types, the built-in markdown functions, like ‘#’ for level 1 header, render as you expect. Most output types also have an additional code base that allows more advanced and customized features.

Output to HTML

The code base for HTML is obviously HTML and cascading style sheets (CSS). HTML is used to structure the content. CSS is used to define the style (e.g., font type and size for each header level).

Pros of HTML output

  1. Has the fewest dependencies and least finicky of the 3 main outputs (i.e., PDF and Word).
  2. You can code straight HTML and CSS in your markdown document (javascript too), so you’re not limited by the built-in functions in markdown or knitr that render R code as HTML.
  3. HTML files are easy to share, and usually renders well on the most common browsers.
  4. Allows you to host your files as GitPages (like this training website).
  5. Relatively easy to make 508 Compliant (but don’t ask me how to do it).

Cons of HTML output

  1. HTML files are hard to make print friendly. Page breaks aren’t an easy thing to add, and headers with a filled background (i.e., NPS banner) are hard to make print as they appear on the screen.
  2. If you share an HTML with someone, they often have to download it first and then open it in their browser. The version MS Outlook or SharePoint shows when opened within the app has less functionality than a web browser. If you have anything even slightly fancy (e.g., tabs) the view from attachments won’t look the same as through a browser (or just shows you the raw HTML, which is even worse!). This means you always have to include these instructions when sharing HTML files.

Output to PDF

Rendering to PDF requires a LaTeX engine that runs under the hood. The easiest engine to install is tinytex, and there are instructions and download files on the “Prep for Training” tab of this site.

Pros of PDF output

  1. PDFs are a more familiar file format and easy to share (i.e., can be viewed directly in MS Outlook/SharePoint).
  2. Enforcing page breaks and designing print friendly outputs is much easier than with HTML output.

Cons of PDF output

  1. LaTeX is a challenging, clunky, and not overly rewarding coding language to work with. There are decent websites offering help and examples, but it’s nothing like the help community and resources for HTML (or R). 2. I’m no expert on LaTeX, so perhaps it just me. But I’ve spent hours fine-tuning a header in LaTeX that takes minutes to figure out in HTML, only to have it break after updating a LaTeX package. If your output needs to be PDF, pause to think whether you REALLY need it to be PDF. If PDF is still what you want, you’ll likely need to use some (or a lot of) LaTeX along the way.
  2. It’s much harder to make PDFs 508 Compliant using R Markdown. The functionality just isn’t there, as far as I know. PDFs have to be modified after they’re rendered to be 508 Compliant, which kind of defeats the purpose of automated reporting.

Output to Word

Rendering to Word can be a helpful way to generate a Results section for a report, or even the entire report for the first draft. It saves having to copy/paste figures and tables from R to Word, and makes life easier if you need to rerun your analysis or update a figure. You just update the .Rmd code and render again to Word. Admittedly I rarely use this option because the functionality is much more limited than outputting to PDF or HTML(See first Con below). See the Resources tab for links to websites and blogs that cover more detail on outputting to Word than we’ll cover here.

Pros of Word output

  1. It’s easy to import a word document with styles as a template, and then to output your document to Word.
  2. Word documents are easy to share, and may be easier to collaborate with than an .Rmd file (i.e. track changes in Word).
  3. You can use a Reference Management System for bibliographies (although see bullet 3 in Cons).

Cons of Word output

  1. This is the most limited output, particularly if you just use R Markdown’s built-in output: word_document. You really can only apply the main styles in the template you imported. Figures and tables don’t always render all that well (or at all), and often need to be tweaked again in Word. Headers and footers don’t render either. To do anything beyond applying main styles, you’ll likely need to employ packages in the officeverse, including officedown and flextable packages. I’ve only spent a little time looking over these packages. They seem useful, but require learning another set of packages and conventions.
  2. Incorporating track changes back into R Markdown is a messy, manual process. It’s not something you’d want to do repeatedly for the same paper.
  3. I have yet to get a reference management system to work in Markdown. While it’s possible, it may be hard to implement on government furnished equipment using approved software.


Other outputs There are dozens of other types of output, including FlexTables/Dashboards, slide decks for presentations, and posters for scientific meetings. There are also templates for journal articles with certain journals, and even templates for building a CV or resume. RStudio’s R Markdown page on Output Formats includes a list with links to a bunch of other output templates. Others can be easily found and downloaded from online sources.



Tables

Tables in R Markdown

There are quite a few packages that can help you make publication quality and customized tables. The two tables I see used most frequently are kable() in the knitr package and datatables() in the DT package (not to be confused with data.table() package for data wrangling in R). The learning curve for kable is pretty shallow, and runs HTML under the hood. The learning curve for DT is a bit steeper, and has javascript under the hood. That means you can customize and add more features using those languages, if you know them. I tend to stick with kable, because I find HTML/CSS easier to code. If I need more bells and whistles, then I use datatables.

First, we’ll load a fake wetland dataset on our GitHub repo to make some summary tables using each package. The code below downloads the dataset from the training GitHub repo, and then summarizes the number of invasive and protected species per site. For both examples, the output format is HTML. If I were outputting to PDF, then I’d need to specify the format as ‘latex’ and use LaTeX code for any custom features not built into kable.

library(tidyverse)
wetdat <- read.csv(
  "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/data/ACAD_wetland_data_clean.csv")
wetsum <- wetdat %>% group_by(Site_Name, Year) %>% 
  summarize(num_inv = sum(Invasive), num_prot = sum(Protected), 
            .groups = 'drop')
kable and kableExtra

The code below creates a simple table that renders in HTML, is only as wide as the records in the table, and has alternating row colors. If you’re outputting to PDF, your format will be “latex” instead of “HTML” and you’ll need to use LaTeX for any custom formatting/styling that aren’t built into kable and kableExtra.

Note also that the version of kableExtra on CRAN currently has a bug that causes collapse_rows() not to function. I’ll show what this does in a minute, but for now, just know that if you want to collapse rows in your kable, you’ll need to install the development version of kableExtra on GitHub. Code for that is below. You’ll need the devtools package installed to install it. If you’ve already loaded kableExtra in your session, you’ll also need to restart your session (Note: Ctrl + Shift + F10 is the fastest way to restart your R session).

devtools::install_github("haozhu233/kableExtra")
library(kableExtra) # for extra kable features
library(knitr) # for kable
wet_kable <- kable(wetsum, format = 'html') %>% # if using pdf, need LaTeX
             kable_styling(full_width = FALSE, bootstrap_options = 'striped') #kableExtra function
wet_kable
Site_Name Year num_inv num_prot
RAM-05 2012 0 3
RAM-05 2017 1 3
RAM-41 2012 0 0
RAM-41 2017 1 0
RAM-44 2012 1 0
RAM-44 2017 1 0
RAM-53 2012 2 0
RAM-53 2017 3 1
RAM-62 2012 0 0
RAM-62 2017 0 0
SEN-01 2011 0 1
SEN-02 2011 0 1
SEN-03 2011 0 0


Note the use of pipes in the code above. The great thing about kable and kableExtra is that you can pipe functions together to build out a large table with all kinds of formatting, including conditional formatting. You can also make a custom kable function that has all of the formatting options you want, and just specify the dataset to build the table for. You can then pipe more features onto that function. We’ll show a couple of these examples below.

# custom kable function that requires data, column names and caption
make_kable <- function(data, colnames = NA, caption = NA){
  kab <- kable(data, format = 'html', col.names = colnames, align = 'c', caption = caption) %>% 
      kable_styling(fixed_thead = TRUE, 
                    bootstrap_options = c('condensed', 'bordered', 'striped'), 
                    full_width = FALSE, 
                    position = 'left', 
                    font_size = 12) %>%
      row_spec(0, extra_css = "border-top: 1px solid #000000; border-bottom: 1px solid #000000;") %>% 
      row_spec(nrow(data), extra_css = 'border-bottom: 1px solid #000000;')

}

# use function with wetsum data
wetkab2 <- make_kable(wetsum, 
                      colnames = c("Site", "Year", "# Invasive", "# Protected"),
                      caption = "Table 1. Summary of wetland data") %>% 
           scroll_box(height = "250px")
Table 1: Table 1. Summary of wetland data
Site Year # Invasive # Protected
RAM-05 2012 0 3
RAM-05 2017 1 3
RAM-41 2012 0 0
RAM-41 2017 1 0
RAM-44 2012 1 0
RAM-44 2017 1 0
RAM-53 2012 2 0
RAM-53 2017 3 1
RAM-62 2012 0 0
RAM-62 2017 0 0
SEN-01 2011 0 1
SEN-02 2011 0 1
SEN-03 2011 0 0


There are a couple of new things to point out in the code:
  1. Because we set the arguments for colnames and caption to default to NA, you don’t have to specify them for the function. If you don’t, the column names in the table will be the names in the dataframe, and the caption will be omitted.
  2. We set the columns to be centered with align = 'c'. If you wanted the first column to be left, and the next 3 to be centered, you would write align = c('l', rep('c', 3)).
  3. The fixed_thead = TRUE means that if a scroll bar is added to the table, the table header (top row), will be fixed. Here we piped a scroll_box at the end of the code to show how that works. You can add a scroll box to the width of the page by adding a width = "###px" to the argument. Note also that if you add a scroll box, you’ll want that line of code to be last. Otherwise you’re likely to run into weird issues with kable that prevent the table from rendering. This is why I piped it at the end, instead of adding to the function.
  4. The position = ‘left’ left justifies the table on the page (default is center).
  5. The line starting row_spec(0, ) adds a black border to the top and bottom of the header, which kable considers row 0.
  6. The final row_spec(nrow(data)) is adding a black border to the bottom of the table regardless of the number of rows in the table.

Finally, here are a few helpful options that render based on values in the table. Note that because we added a scroll box to wetkab2, we’ll start the code over, rather than add the features via pipe to wetkab2. The scroll box always needs to be the last call in your code. If it’s not you’ll get a cryptic error message about Error in UseMethod("nodeset_apply").
  • Conditional formatting was applied to column 3 in the table (wetsum$num_inv). In the code below, the ifelse() that ends in FALSE is just allowing the default color to be printed instead of the conditional color. That allows the alternating row colors to remain.
  • Collapsing rows were applied to column 1. In the rendered table you should see that Sites that are repeated in the data are merged vertically in column 1. This is a really helpful feature to make tables more digestible. Note that once you collapse a row, all pipes afterwards subtract the row you collapsed from the table dimensions, and it can be tricky to deal with. In general, it’s best to have the collapse_rows() pipe after any column_spec() or row_spec() calls. You can also collapse on multiple columns, but it is finicky about the order of the pipes. Just use trial and error until you find the order that works. Note also that collapse_rows() only works in the development version of the package (see above for installation instructions).
wetkab3 <- make_kable(wetsum, 
                      colnames = c("Site", "Year", "# Invasive", "# Protected"),
                      caption = "Table 1. Summary of wetland data") %>% 
           row_spec(0, extra_css = "border-top: 1px solid #000000; border-bottom: 1px solid #000000;") %>% 
           column_spec(3, background = ifelse(wetsum$num_inv > 0, "orange", FALSE)) %>% 
           collapse_rows(1, valign = 'top') 
Table 2: Table 1. Summary of wetland data
Site Year # Invasive # Protected
RAM-05 2012 0 3
2017 1 3
RAM-41 2012 0 0
2017 1 0
RAM-44 2012 1 0
2017 1 0
RAM-53 2012 2 0
2017 3 1
RAM-62 2012 0 0
2017 0 0
SEN-01 2011 0 1
SEN-02 2011 0 1
SEN-03 2011 0 0

Challenge: Change alignment

  1. Using the wetsum dataframe, how would you make a table where all but the first column was center aligned, but the first column left aligned?
  2. How would you make the same table be centered on the page instead of left-aligned?
Answer
ans_tab <- kable(wetsum, format = 'html',
                 align = c('l', 'c', 'c', 'c')) %>%  # Answer 1
           kable_styling(fixed_thead = TRUE, 
                 full_width = FALSE, 
                 position = 'left') # Answer 2 

ans_tab
Site_Name Year num_inv num_prot
RAM-05 2012 0 3
RAM-05 2017 1 3
RAM-41 2012 0 0
RAM-41 2017 1 0
RAM-44 2012 1 0
RAM-44 2017 1 0
RAM-53 2012 2 0
RAM-53 2017 3 1
RAM-62 2012 0 0
RAM-62 2017 0 0
SEN-01 2011 0 1
SEN-02 2011 0 1
SEN-03 2011 0 0
DT::datatables

Using the same wetsum dataset we created earlier, we’ll make a table using datatable() and will add some of the features that kable() doesn’t have and that usually lead me to choose datatable over kable. We’ll start with a basic example and build on it.

library(DT)
wetdt <- datatable(wetsum, colnames = c("Site", "Year", "# Invasive", "# Protected"))
wetdt
The resulting table has several nice features that kable doesn’t offer.
  • The search box allows you to search the records in the text
  • Each column can be sorted using the arrows to the right of each column name
  • Vertical scroll bars are added by default, and only a subset of rows are displayed.

If you want to show more or less entries in your table at a time, you can specify different values by adding options and then specifying values either for pageLength, or for lengthMenu. The pageLength option takes 1 value and will then display that number of entries in the table. The lengthMenu is similar, but also allows you to add multiple values to this list, which are then added to the dropdown menu in the Show [##] entries box. That allows the user to select the number of entries they want to see at a time.

I also added an option that stops the table from spanning the entire page.

# modify pageLength and lengthMenu
wetdt2 <- datatable(wetsum, colnames = c("Site", "Year", "# Invasive", "# Protected"),
                    width = "40%",
                    options = list(pageLength = 10,
                                   lengthMenu = c(5, 10, 20))
                    )



Several additional features with datatable are shown in the code below.
  • CSS can be specified in the table. Here we added a cell-border stripe, which adds vertical lines between columns the same way bootstrap_options added striped cells in kable.
  • Filtering by one or more columns in the table can be really handy. This is often the reason I end up using datatable instead of kable. The code below adds a filter to the top of the columns in the table. This is a particularly useful feature if your tables have a lot of rows.
  • Allowing data to be editable is also potentially really useful. For example, you can add a blank Notes column that can then be updated and saved in the table. In the code below, I made the Notes column editable at the cell-level. Note that you have to double-click on the row to edit it. Just beware that adding notes to tables in R Markdown can cause some strange behaviors with tabs.
  • Piping the formatStyle at the end allows us to set conditional formatting like we did for cable, where any value > 0 in the 3rd column will be orange.
wetdt3 <- datatable(data.frame(wetsum, "Notes" = NA), 
                    width = "40%",
                    colnames = c("Site", "Year", "# Invasive", "# Protected", "Notes"),
                    options = list(pageLength = 10),                        
                    class = 'cell-border stripe',
                    filter = list(position = c('top'), clear = FALSE),
                    editable = list(target = 'cell', disable = list(columns = 1:4))) %>% 
          formatStyle(3, backgroundColor = styleInterval(0, c('white', "orange")))
wetdt3

Figures and Images

Inserting images in Markdown

There are multiple ways to display a image that’s stored on disk. The easiest way to do it is with markdown code in the plain text part of your document, which looks like:

![Map of Region-1 IMD parks.](./images/map_of_parks.jpg){width=400px}

Note that inserting a hyperlinked url is very similar. Just omit the ! and put the url in parenthesis instead of the path to the image. Like: [Link to IMD home page](https://www.nps.gov/im)

You can also use the HTML tag. The code below will produce the exact same image as the markdown code above. I seem to be able to remember better the tag better than the markdown version, so I tend to use this more often.

<img src="./images/map_of_parks.jpg" alt = "Map of Region-1 IMD parks" width="400px">

Map of Region-1 IMD parks..

I also like knitr’s include_graphics() function, as it can make iteration easier. For example, you can include a bunch of figures in a report based on a list of file names. Below I’m including all the photos in a photopoint folder, and making them only 25% of the width of the page. That puts them in a grid, which can be handy. You can also add breaks between each item in the list, and then they’ll plot separately. If you have an analysis with lots of plots, and they take awhile to render, I tend to write the plots to a disk and them bring them into the markdown document with include_graphics(). Using include_graphics() also means you’re running it in code chunks instead of the plain text, and allows you to dynamically number and reference figure names and specify figure captions at the same time. I’ll show that trick in a minute.

```{r photopoints, echo = T, out.width = "25%"}
photos <- list.files("./images/photopoints", full.names = TRUE)
include_graphics(photos)
```

`````


Figure settings

Code chunk options include several handy options to customize figures. These include:

  • fig.align: defines how the figure will be justified on the page. Options are ‘center’, ‘left’, ‘right’.
  • fig.cap: adds a caption to the figure. Must be quoted.
  • fig.height & fig.width: sets the height and width of the figure in inches. Must be numeric and is not quoted.
  • out.height & out.width: sets the height and width of the plot in the opened file. In this case, you set the dimensions as percents, like out.width = "50%" to make the figure half the width of the page of the rendered document.

You can also set global options so that all figures default to a certain size or alignment. That way, you’d only need to specify figure options in the code chunk if you want to stray from your default settings. Global options can be set like:

knitr::opts_chunk$set(fig.height=4, fig.width=6, fig.align='left')

Dynamic numbering and cross referencing

Dynamic figure and table numbering and cross-referencing is one great feature of Markdown. The easiest way to add dynamic figure numbering and cross-referencing is to output to bookdown’s html_document2, instead of rmarkdown’s html_document, which is what I’ve shown so far. You’ll need to add a few lines to the YAML code as well, which we show below. The numbered_sections: false and number_sections: false prevent bookdown from adding numbering to each section. If you like them, you can delete those lines of code. You’ll also need to install bookdown (i.e., install.packages('bookdown')).

output: 
  bookdown::html_document2:
    numbered_sections: false
    number_sections: false
    fig_caption: true

To see how all of this works, we need to create a couple of plots. Sourcing the code below will generate a fake dataset and then creates 2 plots. If this doesn’t work for some reason, you can copy and paste the code directly from the script named Generate_fake_invasive_data_and_plots.R in our IMD_R_Training_Advanced repository. I’m just trying to save room on the page for code that’s not important.

library(dplyr)
library(ggplot2)
devtools::source_url("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/Generate_fake_invasive_data_and_plots.R")

The code below prints the plots with their respective figure numbers. Note that code chunk names should be alpha-numeric, and can’t include spaces or underscores.

```{r fig-inv-all, fig.cap = "Trends in invasive plant cover in NETN parks.", out.width = "50%"}
invplot_all
```
Trends in invasive plant cover in NETN parks.

Figure 1: Trends in invasive plant cover in NETN parks.

```{r fig-inv-acad, fig.cap = "Trends in invasive plant cover in ACAD.", out.width = "50%"}
invplot_ACAD
```
Trends in invasive plant cover in ACAD.

Figure 2: Trends in invasive plant cover in ACAD.

Notice that the figures are numbered consecutively in the order that they appear. For cross-referencing, each code chunk needs a unique name and a figure caption must be defined in the chunk options. To cross-reference, you then write “??”. The same works for tables too, but you use “tab” instead of “fig”. The following text:

As you can see in Figure \@ref(fig:fig-inv-acad), invasive cover appears to be declining. Whereas, invasive cover appears more stable in other NETN parks (Figure \@ref(fig:fig-inv-all)).

renders as:

As you can see in Figure 2, invasive cover appears to be declining. Whereas, invasive cover appears more stable in other NETN parks (Figure 1).



Parameters & Iteration

Parameters

One great feature of R Markdown that really helps with automated reporting is the use of parameters. Parameters are things that you will set in the YAML at the top, and that will be used in the report to do things like filtering data or printing names. For example in the code below, we set park, park_long, and cycle as parameters in the YAML. We then specify them in the next code chunk using params$park to create a plot for just the park we specified, and printed the params$park_long in the caption for that figure. Then we made a table of the park and cycle we just specified using params$park and params$cycle, and again used the param$park_long in the table caption.

The YAML at the top will look like:
---
output: html_document
params:
  park: "ACAD"
  park_long: "Acadia National Park"
  cycle: 3
---

Inside the report, here’s some code to pull in data and make some plots. Note where I added #### Use # of params. That’s where the parameters are specified in the code.

````

```{r imports, include = FALSE} # Import data from github devtools::source_url("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/Generate_fake_invasive_data_and_plots.R") table(invdata$park, invdata$cycle) # Should see 3 parks each with 3 cycles # Create invplot_fun to plot different params invplot_fun <- function(data, park){ p <- ggplot(data %>% filter(park == park), aes(x = cycle, y = inv_cover))+ geom_jitter(color = "#69A466", width = 0.1) + geom_boxplot(aes(group = cycle), width = 0.18, lwd = 0.4, alpha = 0)+ theme_bw()+ theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) + labs(x = 'Cycle', y = "Invasive % Cover") + scale_x_continuous(limits = c(0.9, 3.1), breaks = c(1, 2, 3)) return(p) } # Plot invasives by cycle for specified park invplot <- invplot_fun(invdata, park = param$park) #### Use 1 of param # Filter invasive data to only include specified park and cycle parkinv <- invdata %>% filter(park == params$park) %>% filter(cycle == params$cycle) %>% #### Use 2 of params select(-park) parktab <- kable(parkinv, format = 'html', col.names = c("Plot", "Park", "% Invasive Cover", "% Canopy Cover"), caption = paste0("Table of forest plots in ", params$park_long) #### Use 3 of param ) %>% kable_styling(full_width = FALSE, position = 'left', bootstrap_options = 'striped') %>% scroll_box(height = "400px", width = "420px") ```

````

Finally, I added some plain text below to show how to specify the parameters as inline code in text:

```

### Invasive summary example The following data are summarized for `r params$park_long`

```


Then I print the plot and table using the following code chunks:

```{r plotinv, echo = FALSE, fig.cap = paste0("Invasive trends in plant cover in", params$park_long), out.width = "50%"} invplot ``` ```{r parktable, echo = FALSE} parktab ```

````


The code above will render the outputs below:

Invasive summary example

The following data are summarized for Acadia National Park:

Invasive trends in plant cover in Acadia National Park

Figure 3: Invasive trends in plant cover in Acadia National Park

Table 3: Table of forest plots in Acadia National Park
Plot Cycle % Invasive Cover % Canopy Cover
ACAD-01 3 1.2262568 81.88817
ACAD-02 3 0.2877510 56.05282
ACAD-03 3 4.5368474 76.32668
ACAD-04 3 5.0732908 67.99973
ACAD-05 3 0.2333573 76.22409
ACAD-06 3 0.0000000 88.09978
ACAD-07 3 0.0000000 88.81575
ACAD-08 3 2.9293494 53.97968
ACAD-09 3 0.0000000 75.43722
ACAD-10 3 0.0000000 66.81620
ACAD-11 3 2.5702070 64.26872
ACAD-12 3 0.0000000 97.12055
ACAD-13 3 0.0000000 73.33739
ACAD-14 3 0.0000000 58.74383
ACAD-15 3 0.0000000 75.38991
ACAD-16 3 3.3882025 69.33882
ACAD-17 3 1.1438505 60.82027
ACAD-18 3 1.7857986 64.73210
ACAD-19 3 1.8585776 78.08021
ACAD-20 3 1.0697659 75.02951
ACAD-21 3 1.6636475 67.65857
ACAD-22 3 3.1050853 70.33997
ACAD-23 3 2.3414571 82.94252
ACAD-24 3 0.3196272 73.83292
ACAD-25 3 0.0000000 67.27918
ACAD-26 3 1.4275352 72.12411
ACAD-27 3 3.5254140 62.02793
ACAD-28 3 0.2582069 81.31767
ACAD-29 3 0.0657203 80.97248
ACAD-30 3 2.1435462 54.95987
ACAD-31 3 1.8006971 79.54069
ACAD-32 3 0.0000000 66.28073
ACAD-33 3 0.6599651 71.10860
ACAD-34 3 2.5895022 55.35212
ACAD-35 3 0.0000000 70.15709
ACAD-36 3 0.2552997 72.18458
ACAD-37 3 0.4498149 70.17703
ACAD-38 3 0.0000000 75.88522
ACAD-39 3 0.0000000 89.46357
ACAD-40 3 0.7745124 62.74404
ACAD-41 3 0.7494048 98.09753
ACAD-42 3 0.0000000 88.88522
ACAD-43 3 0.0000000 71.95222
ACAD-44 3 0.0000000 81.87620
ACAD-45 3 3.3797115 74.91392
ACAD-46 3 2.9029968 81.61064
ACAD-47 3 0.3277854 71.95225
ACAD-48 3 0.4411102 65.64562
ACAD-49 3 0.0000000 69.21644
ACAD-50 3 0.0000000 81.55512


Challenge: Change parameters

  1. How would you change the park to Saratoga National Historical Park (code = SARA)?
Answer

Answer 1. Change the YAML to the following:

---
output: html_document
params:
  park: "SARA"
  park_long: "Saratogal National Historical Park"
  cycle: 3
---


Iterating with params

Taking this a step further, you can run a script that iterates through a list (e.g., list of parks), and generates a report for each item on the list. I saved the code above where we created a park-specific figure and table using parameters in the YAML, called example_for_params.Rmd. Now I’ll show the code to render an HTML for each park in the dataset (ACAD, MABI, and SARA). The example .Rmd I used for this is

# load packages
library(purrr)
library(rmarkdown)

# Set in and out directories and file to iterate on
indir <- c("./Markdown_examples/")
outdir <- c("./Markdown_examples/output/") # Included, to see how to change out dir.
rmd <- "example_for_params.Rmd"

# Set up parameter lists. Have to repeat 3 times because each park has 3 cycles
park_list <- rep(c("ACAD", "MABI", "SARA"), each = 3)
park_long_list <- rep(c("Acadia National Park", 
                    "Marsh-Billings-Rockefeller NHP", 
                    "Saratoga National Historical Park"), each = 3)
cycle_list = rep(c(1, 2, 3), 3)

# Create the render function that we will iterate with
render_fun <- function(parkcode, parklong, cyclename){
    render(input = paste0(indir, rmd),
         params = list(park = parkcode,
                       park_long = parklong,
                       cycle = cyclename),
         output_file = paste0("Example_for_", 
                              parkcode,
                              "_cycle_", 
                              cyclename, 
                             ".html"),
         output_dir = outdir)
}

# Map render_fun() to the lists. 
# Note that you must have the same # of elements in each list
pmap(list(park_list, park_long_list, cycle_list), 
     ~render_fun(..1, ..2, ..3))


Bells and Whistles

I’m including several other tricks that have really helped with reports I’ve developed. If we don’t have time to cover them, they’re at least here for you to look over on your own time.

Conditional evaluation

Not only can you set eval = TRUE/FALSE or include = TRUE/FALSE, but you can set these conditionally using objects in your global environment. For example, I have an .Rmd report that runs weekly checks on our forest data during the field season, and it’s primarily looking for errors, like missing data or values that are out of range. I only want the report to print the results of a check if it finds something. The way I do this is below:

````

```{r QCcheck1, include = FALSE} fake_check1 <- data.frame(results = c(1:5)) #fake_check1 <- data.frame() eval_check1 <- ifelse(nrow(fake_check1) > 0, TRUE, FALSE) ``` ```{r, check1, eval = eval_check1} print(nrow(fake_check1)) ```

````

Dynamic Tabsets

Another trick that has come in handy (although is not the easiest thing to implement) is to dynamically generate tabsets using lists. For example, we have one water summary .Rmd that takes park as a parameter. Using the park, the code figures out which sites occur at that park, and populates the site tabs specific to the park for that report. Here’s a simplified version of that approach below.

The first code chunk performs the following steps:
  1. Created a tabset_data dataset with 3 parks. Each park has a different numbers of sites, and each site has been sampled 10 times between 2010:2021.
  2. Used tabset_data to generate a unique list of parks (park_list), a list of sites (site_list).
  3. Created a ggplot function called ggplot_fun to iterate the site_list through.
  4. Created a list containing ggplot objects called plot_list using purrr::map(). Note also that set_names().

````

```{r ts_lists, echo = F, results = 'hide'} library(purrr) library(ggplot2) # Create dataframe with 10 years of data for each plot and number of species found years <- 2010:2021 numACAD <- 10 numMABI <- 5 numSARA <- 7 tot_rows <- length(years) * (numACAD + numMABI + numSARA) tabset_data <- data.frame( site = c(paste0(rep("ACAD-", numACAD * length(years)), sprintf("%02d", rep(1:numACAD, each = length(years)))), paste0(rep("MABI-", numMABI * length(years)), sprintf("%02d", rep(1:numMABI, each = length(years)))), paste0(rep("SARA-", numSARA * length(years)), sprintf("%02d", rep(1:numSARA, each = length(years))))), year = rep(years, numACAD + numMABI + numSARA), numspp = as.integer(runif(tot_rows, 1, 20)) ) tabset_data$park <- substr(tabset_data$site, 1, 4) head(tabset_data) # Create park list for the for loop below park_list <- unique(tabset_data$park) site_list <- unique(tabset_data$site) # ggplot function to implement in for loop ggplot_fun <- function(site_name){ dat <- dplyr::filter(tabset_data, site == site_name) p <- ggplot(dat, aes(x = year, y = numspp))+ geom_point() + geom_line() + labs(x = "Year", y = "Species richness") + theme_classic() } # Create list of ggplots plot_list <- set_names(map(site_list, ~ggplot_fun(.x)), site_list) ```

````

Next code chunk pulls the data together and renders the park and site tabs. The steps are described below:
  1. The cat("## Dynamic Tab Example {.tabset}", "\n") sets up tabsets for the next header level (###)
  2. The first for loop (for(i in seq_along(park_list))) sets up the loop for parks. For each park, a tabset with the park code is rendered via cat("### ", park, "{.tabset}", "\n")
  3. Within the park for loop, we create a park-specific list of sites, which we name park_site_list.
  4. We add a second for loop to iterate through park_site_list
  5. Next we have a long paste0() of things that build out the rest of the report for each element in the park_site_list via site_template.
    1. We create a level 4 tab with the site name via "#### {{site}} \n\n",
    2. We then print level 5 header within that tab via "##### Summary for {{site}}", "\n\n",
    3. Then there’s a code chunk starting with `{r and ending with “``\n\n") that prings the site’s element in the plot_list that we set up in the previous chunk.
  6. The last 2 steps are using knitr to evaluate, unlist, then print all of the text in the for loop.

Note the liberal use of line breaks via \n to make this all work. Tabsets need line breaks between themselves and anything that follows to render as a tab. Also, don’t ask me why sometimes I used {{}} and sometimes I used []. It takes trial and error to make this thing work!

````

```{r net_tabs, echo = F, results = 'asis'} cat("## Dynamic Tab Example {.tabset}", "\n") for(i in seq_along(park_list)){ park <- park_list[i] cat("### ", park, "{.tabset}", "\n") park_site_list = sort(unique(tabset_data$site[tabset_data$park == park])) for(j in seq_along(park_site_list)){ site = park_site_list[j] site_template <- paste0("#### {{site}} \n\n", "##### Summary for {{site}}", "\n\n", "```{r ", site, "_plot, echo = F, fig.cap = ('Trends in species richness in {{site}}'), out.width = '40%'}\n\n", "print(plot_list[site])", "\n\n", "```\n\n") site_tabs <- knitr::knit_expand(text = site_template) cat(knitr::knit(text = unlist(site_tabs), quiet = TRUE)) } } ```

````


Viewing source code

HTML

If you’re rendering to HTML, you can view the actual HTML and CSS under the hood by right clicking on the document in your browser and either inspecting or viewing source code (actual terms differ across browsers). R Markdown makes a lot of default assumptions about the container and divisions and has a lot of built-in CSS that can sometimes be hard to change. Viewing the source code can help you figure out where things are going wrong. If it’s a specific tag or class in a tag, you can redefine those tags or add new classes in your custom CSS.

LaTeX

If you’re rendering to PDF or MS Word, you can save the template used to generate the report by adding the code below to the YAML:
---
output: 
   pdf_document: 
      keep_tex: yes
---

Resources

R Markdown
  • R Markdown: The Definitive Guide. This is Yihui Xie’s book that’s available free online and for purchase. Yihui is one of the main developers at RStudio working on R Markdown and related packages (e.g. knitr, pagedown, bookdown). The online book is searchable, and is often the first place I check, when I can’t remember how to do something.
  • R Markdown Cookbook. Another great book with Yihui Xie as an author. This book is free online and has a lot of small bite-size tips for customizing R Markdown.
  • R for Data Science, chapters 27 and 29. The book itself is worth a read cover to cover. The chapters on R Markdown are also very helpful on their own.
  • RStudio’s R Markdown page: Includes several short videos and lots of tutorials, articles, and a gallery to give an idea of the many things you can do.
  • R Markdown Cheat Sheet.
  • kableExtra vignette: Lots of great examples of the different styling/formats you can use with kables and the kableExtra package.
  • R Markdown for Scientists is an online book with a lot of helpful tips for people like us.
HTML/CSS
  • W3 Schools HTML page: This website includes helpful tutorials on HTML and CSS, includes executable examples for just about every HTML tag you can think of, and shows how different browsers render content.
  • HTML & CSS design and build websites: This book costs about $15 (cheaper if you can find a used copy) and was a very helpful introduction and continual resource, particularly for working with CSS. There’s a Javascript & JQuery book by the same author that’s equally well-done, but was a much steeper learning curve.
LaTeX
  • CTAN.org: This is LaTeX’s version of CRAN.
  • Overleaf Online LaTex Editor: Includes a short guide to learn LaTeX, examples for most common uses of LaTeX, and has a built in editor that you can execute code in.
MS Office

Day 4: Version Control

Intro to version control

Background

As you become more familiar with coding in R, your code will become longer and more complex. You will have projects that you revisit and update each year. You may wish to share your code with others and allow them to make suggestions and contributions.

If your code was a Word document, this is the point where you would turn on Track Changes. Think of version control as Track Changes for your code. Version control is much more sophisticated and flexible, which means that it comes with a steeper learning curve. If you come away from this class feeling like you have no idea what you are doing, don’t despair! Embrace the learning curve and remember that you don’t have to be an expert in version control to take advantage of its most useful features.

Why version control…?

why version control

…because we’ve all been here before. Version control is worth the learning curve because:

  • It lets you take snapshots of your progress, creating a full long-term change history for each file in your project
  • It makes it easier to keep track of the latest working version of your code
  • It makes it easier to merge your changes with changes that your collaborators have made
  • It makes it easier to identify and recover from mistakes
Requirements

There are a variety of version control systems available. We will be using Git, since it is by far the most commonly used system and it is free and open source.

interest in version control systems

You’ll need the following installed and ready to use prior to this course:

  • R 4.0 or higher and recent version of RStudio. The more recent version, usually the better.
  • The latest version of Git for Windows

Getting started with Git

Workflow
Git vs. GitHub

Understandably, lots of people confuse Git and GitHub or use the terms interchangeably. They’re two separate (but related) things though. Git is the version control software that you install on your computer. It runs on your local hard drive and does all the heavy lifting when it comes to tracking changes to your code. GitHub is an online service for storing and collaborating on code. It has a nice user interface, and it is great for facilitating collaboration and project management. Even if you’re working on your own, it’s a great place to back up your local code. Git can be configured to sync your code to and from GitHub when you tell it to.

Git vocabulary

There are a few fundamental concepts that you should keep in mind as you are learning Git. It’s ok if they don’t fully make sense right now - we will keep revisiting them as we practice.

  • Repository
    • You will hear this term a lot. A Git repository is a hidden folder inside of your project that tracks the change history of your project. If you have hidden folders set to visible in File Explorer, it looks like this: git folder. Deleting this folder will delete your entire change history, so don’t mess with it!
  • Stage
    • Unlike Track Changes in Word, Git does not automatically add your latest edits to its change history. Creating an entry in Git’s change history is a two-step process, and staging is the first step.
    • Think of staging your changes as composing a snapshot.
  • Commit
    • Committing is the second and final step of saving an entry to Git’s change history.
    • Think of committing as pressing the shutter button to take a snapshot.
  • Branch
    • Have you ever made a copy of a file so that you could make your own edits without affecting the main working version? Branching in Git allows you to safely edit and experiment without having to manage multiple copies of the same file(s).
    • Terminology note: sometimes you will see the default branch referred to as “main”, and other times it will be called “master”. Current best practice is to use “main”.
  • Merge
    • At one point or another, most of us have had to piece multiple people’s Powerpoint presentations into a single slide deck. Git does the same thing with code, but it’s usually smart enough to do it automatically! No copy-pasting necessary.
  • Push
    • This is the Git equivalent of copying your work from your local hard drive to a shared folder. For our purposes, we will always be pushing our work to a repository on GitHub.
  • Pull
    • This is the Git equivalent of copying the latest version of a file from a shared folder to your local hard drive. For our purposes, we will always be pulling from a repository on GitHub.
The command line

Git was originally developed to be used at the command line. You can still do everything in Git at the command line if you want to, but most people prefer a point-and-click user interface (often referred to as a Git client) for most tasks. For the purpose of this class, we’ll do a few things at the command line, but we’ll focus on using the RStudio Git client when possible. As you get more comfortable with Git, you may find it helpful to practice using Git at the command line in order to understand what’s going on under the hood. If not, that’s completely fine! If you want nothing to do with the command line, you can install a more fully-featured Git client. Search the list of approved software for “Git” to see which Git client(s) are available.

Git and RStudio

You can use Git completely independently from RStudio if you want to, but for the purposes of this class we’ll use RStudio’s built-in Git client. Without it, you have to type Git commands in the terminal. You can download other Git clients that do more than the one in RStudio, but we’ll stick with RStudio in this class since it’s simple and convenient.

First-time setup

Open RStudio. If you have a project open, close it out by going to File > Close Project. Look in the top right pane - at this point, the Git tab will not be present.

no git tab

Go to Tools > Global Options. In the list on the left, select Git/SVN. Make sure that the Enable version control interface for RStudio projects box is checked and verify that everything looks like the screenshot below (your git.exe path may differ, that’s ok).

git options

Click Apply, then OK. Now RStudio should be set up to use Git.

Now you need to connect your GitHub account. The easiest way to do this is to use the usethis package.

Step one: create a personal access token

GitHub uses personal access tokens for authentication. Instead of entering your GitHub username and password into RStudio, you’ll create an access token and store that. A personal access token is more secure because you can revoke it at any time, you can set it to expire after a fixed amount of time, and it doesn’t provide full access to your account. However, you should treat it like a password: don’t share it with others, don’t hard-code it into a script, etc. If you must save it somewhere (not necessary), use a password manager like KeePass.

# This opens up a browser window where you can create a personal access token. It may ask you to log into GitHub or confirm your password.
usethis::create_github_token()

Once you are logged into GitHub, you should see a page that looks like this:

github pat creation

Put something descriptive in the Note field, like “RStudio”. Next, look at the Expiration dropdown. By default, tokens are set to expire after 30 days, at which point you will have to repeat this process. You have the option to set the expiration date to later or never. Use your best judgment. You can leave the scope settings as-is. Scroll to the bottom and click Generate token. The only time that GitHub will show you your personal access token is when you create it. So make sure you copy it to your clipboard so that you can finish connecting to GitHub from RStudio!

github pat

Step two: insert your personal access token in the Git credential store

Run the code below and when prompted, paste in your personal access token.

gitcreds::gitcreds_set()

If you already have Git credentials stored, gitcreds_set() will prompt you to keep or replace them. Unless you have recently gone through the process of creating a personal access token, go ahead and allow it to replace them with your new access token.

Creating a new project

The easiest way to create a new RStudio project with a Git repository is to go to File > New Project > New Directory > New Project and make sure to check the Create a git repository box before you click the Create Project button.

create git project

Alternately, you can use the usethis::create_project() and usethis::use_git() functions to initialize a project with a Git repository. usethis::use_git() is also the easiest way to add a Git repository to an existing project.

At this point you should have an empty project open in RStudio, and you should see a Git tab in the top right panel. Note that everything you have created so far exists only on your local hard drive - we haven’t done anything with GitHub yet.

Click on the Git tab. If you used the create package wizard, you’ll see two files with yellow question marks next to them.

initial git tab

The question marks mean that Git recognizes that you’ve created new files, but it is not tracking changes to them yet. It seems like it should start tracking change by default, like a word processor, but we don’t necessarily want Git to track every file we create in our project folder. Remember that eventually we’re going to put our repository on GitHub, in the cloud. GitHub has privacy settings, but even so we don’t want to store things like sensitive or unpublished data there. Information for connecting to a database is another example of something we might store in our project folder but wouldn’t want to publish to GitHub.

Before we do anything else, let’s make our first commit. Remember, think of a commit as a snapshot of our project at a given point in time. In the Git tab, click on the Commit button. commit button

initial commit

There’s not much to see here yet. Note the Staged column in the file list. In order to stage a file for commit, click the checkbox next to it. Go ahead and stage both files. Note the green A next to each file: this means that you’ve told Git to include the contents of the file in the next commit. A common misconception is that checking the Staged box actually creates a new commit. That’s not the case, though! You’ve composed your snapshot, but you haven’t pressed the shutter button.

To actually commit the staged files, enter something brief but descriptive in the Commit message box (“Initial commit” is typically used for the first commit). Then click on the Commit button. Now you’ve taken your snapshot! Git will give you some feedback, letting you know that your commit includes 17 new lines of code in two files, and two new files were created: .gitignore and git-practice.Rproj.

commit feedback

Click Close.

Creating a GitHub repository

At this point, it’s time to create a home for your project on GitHub. Even if you never share your GitHub repository with anyone else, it’s a great place to back up your code. Log into GitHub and click the + in the top right corner. Select New repository.

new github repo

Leave Repository template set to No template. Give your repository a name. Ideally this should match the name of your project, for your own sanity, but it doesn’t have to. Go ahead and include a very brief description of your project. For the purpose of this class, set your repository to Public. In general, use your best judgment when it comes to public vs. private repositories.

Things to consider: how easily do you want to share your code with others? Can your code benefit others? Is there a risk of accidentally committing sensitive information to the repository? Is there any sensitive information in your code? You should never ever hard-code things like login information, but something like code pertaining to sensitive species in a park might be worth keeping private.

In the Initialize this repository with: section, leave all the boxes unchecked. You already have a .gitignore file and you’ll create a README later. Don’t worry about a license for this code. Click Create Repository and copy the URL to your repository as shown below.

fresh github repo

Connecting a local Git repository to GitHub

Now you have a local repository on your hard drive and a repository in GitHub (often referred to as a remote repository). Now it’s time to make them talk to each other. From the terminal tab in RStudio (still in your git-practice project), follow the “push an existing repository from the command line” instructions shown in your GitHub repository. Make sure to replace the URL with the URL to your own GitHub repository!

If you don’t see a terminal tab next to your console tab, go to Tools > Terminal > New Terminal.

connect to gh repo

If this is successful, you’ll see several lines of feedback in the terminal, ending in “Branch ‘main’ set up to track remote branch ‘main’ from ‘origin’.” Verify that everything worked correctly by refreshing the webpage for your GitHub repository.

successful gh initial push

Finally, look at the Git tab. The green and blue arrow buttons should no longer be grayed out and there should be no files listed as changed or added.

Other setup scenarios

Existing project without local Git repo

If you have an existing project but you didn’t check the Create a git repository box when you created it, don’t despair. Make sure your project is open, then go to the Terminal tab, type git init, and hit Enter. Now you can create and connect a GitHub repository as described above.

New project from existing GitHub repo

This is a very common workflow. This what you will do if someone shares a GitHub repository with you and asks you to contribute to it. You can also do this if you want to create a new GitHub repository first, then a new project. In GitHub, copy the repository URL:

copy repo url

Next, open RStudio and go to File > New Project. In the popup, select Version Control, then Git. Under Repository URL, paste the URL that you just copied from GitHub. Project directory name should default to the repository name. If not, just type it in. For Create project as subdirectory of:, select whichever folder you want to store your R projects in. Click Create Project. You’re done! You now have an R project containing the code from the GitHub repository. It’s already set up with a local repository and a connection to the GitHub repository. Make sure the owner of the GitHub repository has set the permissions so that you can push to it.

project from git

This is known as cloning a repository - you are copying the whole GitHub repository to your local hard drive, and setting up your new local repository to sync with the GitHub repository when you ask it to.

Working by yourself

Version control is essential to effective collaboration on code projects, but it’s also an incredibly useful tool when you’re working by yourself. It’s also a great way to get more familiar with the version control workflow without the added complexity of working with collaborators.

Overview diagram

Branching

If you are adding a new feature to your code, it’s good practice to create a new branch to work from. Examples of things you might create a branch for are a new analysis, a complex map or plot, or a few simple maps or plots. You might also create a new branch to fix or improve existing functionality.

Ideally, you should not work from the main branch. Instead, reserve it for the most recent version of your project that is known to work. Later in this lesson you’ll learn how to incorporate code from another branch back into the main branch.

To create a new branch from RStudio, go to the Git tab and click on the branch branch icon icon. Pick a concise but descriptive name for your new branch. For this class, call it birds-eda since we’ll be doing some exploratory data analysis on a bird checklist.

Tip: it’s common to format branch names as all lowercase with words separated by dashes.

Leave the Remote dropdown set to “origin” and leave the Sync branch with remote box checked. This just means to use the GitHub repository we set up in the last lesson and automatically create a new branch there to match our local one. If you uncheck the box, you will not be able to push your branch to GitHub.

Click Create. Look at the Git tab. Your current branch (top left corner) should be the branch you just created. Click on the dropdown to see all of your branches. LOCAL BRANCHES refers to the branches in the repository on your local hard drive. REMOTE:ORIGIN refers to the branches in the connected GitHub repository. You can more or less ignore the stuff under REMOTE:ORIGIN and just focus on your local branches. If you do click on one of the REMOTE:ORIGIN branches, Git will try to sync any new changes from that GitHub branch to your corresponding local branch.

branches dropdown

Staging and committing

As you saw when you created a new repository in GitHub, tracking changes in Git is a manual process. When you create a new file, you must tell Git to begin tracking it. Untracked files are designated by a yellow question mark untracked file.

To practice, go ahead and create a new R script. Paste in the following code and save it as birds.R.

# Read in bird checklist
birds <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/data/birds.csv")

# Preview dataset
head(birds)

Time to stage and commit the new script. Click on the Commit button in the Git tab. Check the box in the Staged column. Notice that the code you just added is highlighted below in green. That’s Git’s way of indicating lines that were added since the last commit. birds.R is a new file, so everything is green. Add a commit message and click Commit.

birds.R commit number one

Notice the message in the Git tab: branch ahead of origin This means that you now have one commit in your local repository that doesn’t exist in your GitHub repository. We’ll make one more commit and then we’ll get in sync with GitHub.

We created this branch for exploratory data analysis, so let’s do some. We’ll change our current code to use the Tidyverse, then explore the dataset a little. Replace the contents of birds.R with the following:

# Load packages
library(dplyr)
library(readr)

# Read in bird checklist
birds <- read_csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/data/birds.csv")

# Preview dataset
birds

# Verify that all species names are unique
paste("There are", nrow(birds), "rows in this dataset.")
paste("There are", nrow(distinct(birds)), "unique rows in this dataset.")

# How many species groups are there?
paste("There are", n_distinct(birds$species_group), "species groups.")

# Any missing data?
any(is.na(birds$species_group))
any(is.na(birds$primary_com_name))

# Count species in each species group
group_size <- birds %>%
  group_by(species_group) %>%
  summarize(species_count = n()) %>%
  arrange(-species_count) %>%
  ungroup()

# What are the largest species groups?
group_size

# What are the smallest species groups?
tail(group_size)

You know the drill at this point. Save the file and open the commit window (Git tab > Commit), but don’t check the box in the Staged column yet.

Notice the lines in red. Git parses a change to a line of code as the deletion of the original line and the addition of a new line with the change applied. Lines with no red or green highlighting are lines that haven’t changed. In longer files, Git will hide unchanged lines, so don’t panic if you see this - your code hasn’t disappeared.

Now we’re going to see why the two-part process of staging and committing is so useful. We did two things when we updated birds.R - we updated the data import code, and then we did some exploratory data analysis (EDA). It would be nice for our commit history to reflect those distinct changes. Click on line 1 (# Load packages), hold down the Shift key, and click on line 9 (birds) to highlight the chunk of data import code that we changed. It should be selected in blue. At the top of the selection, click on the Stage selection button.

stage selection

At the top of the code preview, toggle Show from Unstaged to Staged. Instead of staging all of our edits to birds.R, we only staged the edits we highlighted. Go ahead and commit those staged changes.

Since we didn’t commit everything, our EDA code still shows up as unstaged. Stage it now and commit it, but don’t close out the Review Changes window.

commit selection

In the top left of the Review Changes window, click on History. Here you can review your project’s commit history. Each row is a commit. Since we wrote helpful commit messages, it’s easy to get a sense of what has been done.

Right now we’re just looking at the history for the birds-eda branch, but you can use the dropdown next to the History button to view history for other branches or for the whole project.

commit history

The colored labels indicate branches. HEAD -> refs/heads/birds-eda means you are currently on the birds-eda branch. If we switched to the main branch, it would then be labeled HEAD -> refs/heads/main and the tracked files in our project folder would change to reflect the state of the main branch. Branches whose labels start with origin/ are the branches in your GitHub repository.

You can click on a commit to see what changes are part of that commit and which files were involved. The alphanumeric nonsense labeled SHA is just a unique identifier for that commit.

(If you’re interested, SHA stands for Simple Hashing Algorithm. The contents of the commit get fed into an algorithm that generates a checksum, a unique identifier for that commit.)

Pushing to GitHub

Now that you’ve made a few commits, you should push your birds-eda branch to GitHub. That way, it’s backed up in case of hard drive failure. It’ll also be easier to share the code with other people. Verify that you don’t have any uncommitted changes, then use the green Push arrow push arrow to push birds-eda to GitHub. You can find this button in the Git tab or in the Review Changes (commit) window. Both buttons do the same thing. You’ll get some feedback that the branch was pushed; you can close that window once you verify that there are no error messages.

Note: Pushing a branch to GitHub only pushes commits made to that branch. If you have uncommitted work or work in other branches, it will not end up in GitHub.

Merging

In a typical project, you will repeat the process of staging, committing, and pushing a few times before your branch is ready to become part of the working, authoritative version of the project that is the main branch. You’ll also want to make sure to test the code in your branch thoroughly to make sure that it works!

For the purposes of our demo project, we’ll say that we’re done with EDA and birds-eda is ready to be included in main. To do this, we merge birds-eda into main.

You must be on a branch (main) to merge another branch (birds-eda) into it. In the Git tab, click on the branch dropdown (it should currently say birds-eda), and switch to your local main branch. You will now see a popup that strikes fear into the hearts of the uninitiated:

commit history

Don’t panic. When you switch to a different branch, the files in your project directory will update to reflect the series of commits that are part of that branch. There’s only one commit on main - the initial one. You can verify this by clicking the History button commit history button in the Git tab. birds.R doesn’t exist in the main branch yet, but if you switch back to birds-eda it will return.

Now, we get to feel like real hackers for a minute. Navigate to the Terminal tab in the bottom right pane of RStudio. Verify that you are on the main branch, then type the below command. Pay attention to spacing, punctuation, and capitalization - typos are the most common reason that Git commands fail.

git merge birds-eda

Hit Enter. You should see some feedback in the terminal:

merge feedback

You can verify that the merge was successful by looking in the Files tab - you should now see birds.R.

The last step is to get the main branch in GitHub back in sync with the local main branch. Same as you did with birds-eda earlier, we’ll push our local branch to GitHub by clicking on the green Push arrow.

If you look at your commit history, you can see that all branches are now up to date with birds-eda.

all branch commit history

Congratulations! Now you have the skills you need to start using version control for your code projects.

Guidelines

As you practice this workflow, here are some guidelines to keep in mind:

  • Commit (snapshot) your work often: Small, specific commits make it easier to undo mistakes later
  • Write useful commit messages: Be brief but descriptive; resist the urge to save time with generic messages
  • Push to GitHub often: If your hard drive fails, this is your backup
  • The main branch should work: Don’t merge code into the main branch if you know it’s broken (but if it happens, don’t panic - just fix the bug on a new branch and merge back into main)

Working with others

guidelines: same as working by yourself, plus: communicate with collaborators, don’t delete or modify commits once you have pushed them (more on this later)

Overview diagram

Branching - same as before.

Both: Make a new branch and switch to it

Staging and committing - same as before.

Each: Make a different file Each: Stage and commit Each: Push branch

Ignoring files in Git

You’ll usually have files in your project that you don’t want to include in its permanent history. This could include things like data, database connection information, or user-specific configuration files. Depending on the situation, you also may want to avoid tracking files generated by your code, since you can just recreate an up to date version of these files by rerunning your code.

This is where the .gitignore file comes in handy. Open it up, and on a new line add the name of the file or folder you want Git to ignore. You can also use a regular expression to specify multiple filenames matching a certain pattern. It will take effect when you save .gitignore but you may need to click the refresh icon in the top right of the Git tab to see the files disappear from the status view.

git ignore

WARNING: If you accidentally commit sensitive information, deleting it and/or adding the file to .gitignore is not enough! The information will still exist in your commit history. See here for information on how to handle this situation.

Pulling part I

Pull = fetch (get changes) + merge (incorporate them with your stuff). When you are on a branch, pulling retrieves the remote (GitHub) version of that branch and merges it into your local version.

Collab 1: Pull main, pull own branch in case collab 2 made any changes

Pushing to GitHub - collaborative version

Don’t push main to a shared GitHub repo!

Push your branch + create pull request in GitHub

Pull requests

NOTE: pull request != pull. You are asking your collaborator to pull your branch, review it, and then merge it into main if all looks good. Why isn’t it called a merge request? Because sometimes you just want feedback on your code and aren’t ready to merge it into main

Collab 2: Review pull request, merge into main

Pulling part II

Collab 2: Now main has new commits that we don’t have locally. Make sure that everything works with our code by merging from main, then push our branch and do a pull request. Collab 1 will review and merge into main.

Rinse and repeat

Troubleshooting

As you learn Git, you will inevitably end up in a tight spot here and there. DO NOT PANIC. It is very hard to permanently lose your work in Git. It is very easy to get confused and think that you did. We’ll talk through some common issues and how to fix them, then talk about what to avoid when you’re starting out

Branch won’t push

Chances are it’s because there are commits in the remote branch that aren’t on your branch. While pulling a branch will do an automatic merge, pushing it will not. You need to pull the remote branch, then push your local branch

I don’t see my commit in my local commit history

Are you on the right branch? If not, switch branches. If you’re on the right branch, it’s because you didn’t commit. A common mistake is to stage changes and forget to hit the commit button. Remember, staging is like framing a snapshot and committing is pressing the shutter button.

I don’t see my latest changes in GitHub

Are you looking at the right branch in GitHub? If not, switch branches. If you’re on the right branch: did you remember to push? If you remembered to push: you probably have uncommitted changes (see above). Commit them and push again.

I can’t pull/merge/switch branches

You probably have changes that haven’t been committed. When you pull or merge, Git combines your commit history with the commit history of the branch you are pulling or merging (remember, pull is just fetch remote branch + merge it). Then it actually changes your working directory files to reflect the new state of the branch. If all your changes have been committed, that’s not a problem. If you have uncommitted changes, they would get overwritten if Git didn’t throw an error instead.

When you switch branches, Git does a similar thing - it changes the contents of your working directory to reflect the current state of that branch. If your changes aren’t committed, they would be overwritten.

Just commit anything with a blue M next to it and try again. Alternately, do git stash, do the pull/merge/switch, then git stash pop.

AAAAAAAA MERGE CONFLICT EVERYTHING IS BROKEN AND NOTHING WORKS

Take a deep breath. This is confusing at first but it’s not so bad. Merge conflicts happen when you and another person make different changes to the same line of code. They can usually be avoided by communicating with your collaborators and making sure you’re not working on the same thing. They still happen though, and the fix is simple: just edit the offending file to look like it should. Find the markers indicating the conflict, then keep the version you want to keep (or keep both, or make some edits). Then save, stage, and commit.

My push/pull buttons are grayed out in RStudio!
My code is broken!

For simple mistakes, just fix it and commit. If a more complex change broke your code and you can’t fix it, you can revert the offending commit (this is a good reason to be thoughtful about how you organize and comment your commits).

Things to avoid

DO NOT DELETE YOUR .git FOLDER Don’t use git reset when collaborating (unless you know what you’re doing). This is a good way to create a headache for you and your collaborators.

Day 4: R Packages

Intro to R Packages

Background

Once you’ve moved on from writing stand alone long R scripts to writing custom functions and iterating workflow, the next logical step is to start building your own custom package to more easily apply, document and share your code.

Reasons to build an R package

  • Organize and document your code
  • One place to update your code
  • Easier to share across projects and with others
  • Using version control and packages together allows you to build and test new features without breaking the package for others.
Note: Not everything needs to be a package! If you’re scripting a one-time analysis that you’re never going to use again, it doesn’t need to be a package. However, given that we’re in the game of long-term monitoring, and we’re often performing the same tasks with our data over and over, custom R packages are a powerful tool that, if well built, will add enormous efficiency to your workflow.


Requirements Before you can start building packages, you’ll need the following installed and ready to use:
  • R 4.0 or higher and recent version of RStudio. The more recent version, usually the better.
  • Rtools for R 4.x (hopefully you installed this prior to this session.)
  • Packages: devtools, roxygen2. Both packages are actively being developed. To take advantage of the most features, and to make sure your computer behaves the way ours does, you should install the latest version of each.


Note that while you don’t need GitHub to create a package, the workflow we are using today assumes you are using GitHub for version control and to host your package. Therefore we strongly encourage you to create a GitHub account and to start using GitHub version control alongside your package development.


Create a Package

There are multiple ways to build a package, and it just keeps getting easier thanks to RStudio. The steps below are the ones that most consistently have worked for me as of January 2022.

Creating a package with the file menu and Git Terminal (the old way):
  1. Create a repo on Github
    1. Log into your GitHub account in your web browser (if not already logged in).
    2. In the upper-right corner of the page, click on the “+” button and select “New repository”
    3. Name the repository and leave the rest as is and click on the “Create repository” button at the bottom of the page. Leave the next page open, as you’ll use the code example to connect your RStudio project to GitHub.
  2. Open RStudio and create a new project
    1. Go to File > New Project > New Directory > R Package
    2. Name the package, specify the directory, check “Create git repo”. Note that special characters, like underscores aren’t allowed in package names. I’m going to call this testpackage2.
    3. Click “Create Project”
  3. Go to the Terminal Tab in your Console Pane (typically in bottom left of screen) and type in the code that is listed on github under “…or create a new repository on the command line”, pressing return between each line. Note you can also open the Git Shell by going to the Git tab in your Environment Pane. Click on the down arrow to the right of the RStudio gear box, and click on Shell).

Once those steps are completed, check that it worked by going to git tab to pull from GitHub. If the down and up arrows are grayed out, something went wrong. If they look like the image below, and you can pull down from GitHub, then you’re all set.

Example git code for setting up the repo on my account:


Creating a package in code (the easy way):

The easiest way to create a new package in RStudio is using the usethis package. You’ll first need to have a GitHub account and have RStudio connected to your GitHub account. Once that’s working, you can run the code below in your console.

  1. Create a new package. The code below should create a new package called testpackage and will open a project in a new R session.
  2. usethis::create_package("D:/NETN/R_Dev/testpackage") # update to work with your file path
  3. Set up local git repository. After running the code below, you’ll be prompted whether it is okay to commit the files created by the package template. Select 1 (Absolutely). Then you’ll be prompted to restart RStudio. Select 2 to restart.
  4. usethis::use_git() # sets up local git for new package
  5. Create GitHub repo. After running the code below, you’ll be prompted whether it is okay to commit the files created by the package template to GitHub. Select 3: Definitely. This will establish the connections to GitHub. If you don’t select 3, you’ll have to establish the connection manually via the Terminal.
  6. usethis::use_github() # creates new github repo called testpackage. 
  7. Add license and change branch name. AFter you run the code below, your browser will open to your GitHub repo (assuming you have the same settings as my computer). You’ll be prompted to click “OK” to change the branch from master to main. After you complete that step, you should see the green and blue arrows in your RStudio Git pane get activated.
  8. usethis::use_mit_license() # set license to MIT license (or use a different license.)
    usethis::git_default_branch_rename() # renames master to main 
And you’re done! Obviously this is much easier than working through the git shell. This is the approach we recommend, and this is the package we’ll continue to build on through the training. We’ll talk about the license piece later in the training. Now that we have a new package created, we’ll talk about the pieces that make up a package and look through the default files included in the package.


Package Anatomy

The basic building blocks of an R package are defined in the list below. At a bare minimum, there are 2 files and 2 folders that make up an R package. The 2 required files are DESCRIPTION and NAMESPACE. The two folders are “R” and “man”. Several additional files that improve documentation and git workflow are also added by RStudio’s package template. Your Files pane should look something like this:

RStudio

Package files explained
  • .gitignore: add files you don’t want to be pushed to your github repo. I often put my data folder on this list, so I don’t accidentally push my data there.
  • .gitbuildignore: you can add files here that you don’t want to be included in the build. I sometimes have temporary scripts that I’m testing with, but don’t want to be included in the package. Adding them here prevents them from being included in the package build.
  • DESCRIPTION: This is the overall documentation for the package, including authors, version, package description, and imports and suggests.
  • man folder: This is where the documentation for the functions is stored, and that also make up the help files for the package. By default, there’s a Hello.Rd file there, which we’ll delete. These files are generated by the roxygen2 code that are written for each function (more on that later).
  • NAMESPACE: Documents the imports (i.e., package dependencies) and exports (i.e., your package functions) for your package. If you document the DESCRIPTION and function roxygen2 text correctly, this should be automatically populated for you. Occasionally something goes wrong and you have to modify the file manually.
  • R folder: This is where your functions live. By default, there’s a Hello.R function in that folder, which we’ll delete.
  • README.md: Use this README file to document more about your package beyond the 1-2 sentence description in the DESCRIPTION file. GitHub will display this README under the code on your repo page. You can also edit this file in GitHub, and then pull the changes into your RStudio project.
  • testpackage.Rproj: This is the name of the R project, which is also the same name as the package. This makes transferring the project to different directories or different computers easier.
  • test folder: We didn’t add this, so you won’t see it in your files. However, if you build unit tests (highly recommended, but not followed often enough), they’ll live here.


Build Tools

Build Tools

The last step before we get to start adding to our R package is to make sure the Build Tools are set up and functioning properly.

  1. Go to Tools > Project Options > Build Tools. Make sure the “Generate documentation with Roxygen” box is checked.
  2. If you’re not directed to the Roxygen Options window, click on the “Configure” button. Make sure that under “Use roxygen to generate:”, you at least have checks next to Rd files, Collate field, NAMESPACE file. Under Automatically roxygenize when running:“, make sure you have checks next to R CMD check, Source and binary package builds, and Install and Restart. See graphic below as example. If you plan to build vignettes and want them to be rendered each time you rebuild your package, you can check that box too.
  3. We’re going to change the roxygen text at the beginning of hello.R to make sure the build updates the hello.rd in the man folder.
    1. First, delete the hello.Rd in the man folder. While you’re at it delete the NAMESPACE in the main folder (we’ll let Build Tools generate it from scratch).
    2. Now, open hello.R, and delete all of the commented text.
    3. Add the following code at the top of the file, then save.
    4. #' @title hello
      #' @description test package for R training
      #' @export
  4. If you have Rtools installed and correctly created your package, you should see a Build tab in your Environment Pane for that project. Click on “Install and Restart” (or Crtl + Shift + B). The hello.R file and Build output should look like the image below. If you open your NAMESPACE file, you should see export(hello) listed. That’s what @export does for you.


  5. Now run the code below to view the help file you just made for the hello function in hello.R

  6. ?testpackage::hello

If the text you added shows up in the help file, your Build tools are all set. If the Build exited with status 1, then something is wrong with the roxygen text in your hello.R file. Review the build results to see if you can find the line identified as failing. Also check that each of the lines you added are commented with both symbols ( #'), and that the terms following the @ are spelled correctly and don’t have a space.

The last thing to do is delete the hello.R file to clean up the package. You don’t need to delete the hello.Rd file, as it will be deleted the next time you rebuild your package.



Package Development

Add and document the first function

Now we get to add to our package and make it useful! We’re going to add a simple function to the package that I use all the time for my workflow. It’s a function that takes the first 3 letters of a genus and species to create a species code. It saves me having to type out full species names when I’m filtering through a lot of data.

To follow along, go to File > New R Script (or key Ctrl + Shift + N) and copy the code below to the script.

#' @title make_sppcode
#' @description Make a 6-letter code with first 3 letters of genus and species
#'
#' @importFrom dplyr mutate select
#' @importFrom stringr word
#'
#' @param data Name of data frame that contains at least one column with Latin names
#' @param sppname Quoted name of the column that contains the Latin names
#'
#' @return Returns a data frame with a new column named sppcode.
#' @export

make_sppcode <- function(data, sppname){
  data$genus = word(data[,sppname], 1)
  data$species = ifelse(is.na(word(data[,sppname], 2)), "spp.", word(data[,sppname], 2))
  data <- mutate(data, sppcode = toupper(paste0(substr(genus, 1, 3),
                                                substr(species, 1, 3))))
  data2 <- select(data, -genus, -species)
  return(data2)
}

Note in the Roxygen code above, we added the title, description, and export like we did for hello.R. We added a few more arguments to the Roxygen2 text at the top, including imports, params, and return.

Imports

Now we also added 2 imports, which are dependencies of your R package. The first are mutate and select in the dplyr package. The second is the word function in the stringr package. By adding these 2 lines to the Roxygen, these two functions will become part of the Namespace of the package (more on that later), and will be usable by any function in your package.

If you use all base R functions within the functions of your package, you don’t need to use imports. In general, best coding practices are to minimize the number of dependencies to reduce the number of packages a user needs to install before using your package, and make it less likely that your package code will break because a dependency was updated. I use them here to show you the workflow when you need dependencies (e.g., it’s hard for me not to want dplyr at some point in a package). Another note is that @importFrom will only add the functions for that package in the Namespace, so you’re less likely to have conflicts with other packages. If you want to make the entire package available to the package Namespace (e.g., I’ve done this with ggplot2), then you’d write: #' @import ggplot2

Parameters

Parameters are where you define the inputs to your function. If an input only takes certain arguments, like TRUE/FALSE, or a list of park codes, @param is how you document that to the user. Note that if your package functions share the same parameters, you can inherit parameters from other functions, instead of having to copy/paste them across functions by adding #' @inheritParams make_sppcode.

Return

The @return argument tells the user what to expect as the output of the function.

Export

This @export argument tells R to export that function into the NAMESPACE file.


Update DESCRIPTION file

There’s one last piece of documentation we need to complete before dependencies will be installed when your package is installed. Open the DOCUMENTATION file. It should look like:

You’ll want to update the Title, Author, Maintainer, and Description, which are pretty self-explanatory. As you update your package, you’ll also want to update the Version number. Next we need to add the Imports and Suggests to the DESCRIPTION, which are defined below.
Imports: Packages listed under Imports will be installed at the same time your package is installed. You can also set the minimum version number that, if users don’t have, will be installed.
Suggests: These packages are not installed at the time your package is installed. Suggests are helpful for external packages that are only used by one or a few functions in your package. For example, one of our packages has a function that imports data directly from our SQL Server, but only a few network staff can access the server. The external packages that the SQL import function uses are listed under Suggests. The SQL import function then checks to see if the suggested packages are installed on the user’s computer. If not, it will stop and print an error that it needs to be installed. We’ll show that workflow later.

You can either manually add these to the DESCRIPTION file like:

OR, you can use the usethis package to do the heavy lifting!

usethis::use_package("dplyr") # for imports which is the default
usethis::use_package("stringr") # for imports which is the default
usethis::use_package("ggplot2", "Suggests") # for suggests
Note also that the License should be MIT + file LICENSE, if you followed the usethis workflow we showed earlier to create the package. I don’t know a lot about licenses, other than it’s best practice to set one. The MIT license is the most common passive license that means your code is open source and allows anyone to copy code with minimal restrictions. If you want all derivatives of your code to be open source, the GPLv3 license is the most common license (usethis::use_gpl_license()).


Rebuild and Document package

We’re finally ready to document the package (note you could have done it after each step). Go to the Build tab and click “Install and Restart” (or Ctrl + Shift + B). Assuming the roxygen and DESCRIPTION were written correctly, you should now see a make_sppcode.Rd in the man folder. You can also check that help works for the function:

?testpackage::make_sppcode


Check your NAMESPACE

Open your NAMESPACE file. It should look like this:

The Namespace should contains all of the functions you’ve built for your package as exports, along with all of the external dependencies you are using within your functions as imports. As you add more functions and dependencies, they are added here each time you rebuild your package. You can also store data in the namespace, which can then be accessed by your package functions.

The concept of Namespace is a special beast, and can be a bit hard to wrap your head around. In a nutshell each package has its own environment that contains all the package’s functions, dependencies and objects (e.g., data) that have been defined for that package. This environment is separate from your global environment. When you load a package in your session, the package’s environment is accessible, but only through its functions. For example, dplyr is a dependency of our testpackage, When we load testpackage (e.g., library(testpackage)), the testpackage’s functions can use dplyr. However, if we need dplyr outside of testpackage functions, we have to load it first.


Test your package

Now that the documentation is all set, let’s test that the make_sppcode() function actually works! Try running the code below to see if it works.

library(testpackage)
example_dat <- data.frame(Latin_Name = c("Carex limosa", "Arethusa bulbosa", 
                                         "Malaxis unifolia", "Calopogon tuberosus"), 
                          cover = c(10, 40, 10, 50),
                          stems = c(50, 20, 10, 10))

example_dat2 <- make_sppcode(example_dat, "Latin_Name")
head(example_dat2)
##            Latin_Name cover stems sppcode
## 1        Carex limosa    10    50  CARLIM
## 2    Arethusa bulbosa    40    20  AREBUL
## 3    Malaxis unifolia    10    10  MALUNI
## 4 Calopogon tuberosus    50    10  CALTUB


Add Examples

While examples are not required, they are by far the best way to help users understand how to use your functions. They’re also breadcrumbs for future you as a reminder of how it works. Examples work best when you first create a simple fake data set to run with the function. That way, a user can easily reproduce and run the code on their machine. We just created the example we’re going to add in the process of testing the function. The code chunk below shows how to add it. Note that if you want to show an example that takes a long time to run, so don’t want it to run while building or checking the package, you can add \dontrun{ example code here }.

#' @title make_sppcode
#' @description Make a 6-letter code with first 3 letters of genus and species
#'
#' @importFrom dplyr mutate select
#' @importFrom stringr word
#'
#' @param data Name of data frame that contains at least one column with Latin names
#' @param sppname Quoted name of the column that contains the Latin names
#'
#' @return Returns a data frame with a new column named sppcode.
#'
#' @examples
#' library(testpackage)
#' 
#' example_dat <- data.frame(Latin_Name = c("Carex limosa", "Arethusa bulbosa", 
#'                                          "Malaxis unifolia", "Calopogon tuberosus"),
#'                           cover = c(10, 40, 10, 50),
#'                           stems = c(50, 20, 10, 10)))
#'
#' example_dat2 <- make_sppcode(example_dat, "Latin_Name")
#' head(example_dat2)
#'
#' @export

make_sppcode <- function(data, sppname){
  data$genus = word(data[,sppname], 1)
  data$species = ifelse(is.na(word(data[,sppname], 2)), "spp.", word(data[,sppname], 2))
  data <- mutate(data, sppcode = toupper(paste0(substr(genus, 1, 3),
                                                substr(species, 1, 3))))
  data2 <- select(data, -genus, -species)
  return(data2)
}
Run CMD Check

The last thing you need to do before posting your package to GitHub for others to use is to run the R CMD check. You can do this 3 ways. You can either click on the Check ( ) in the Build tab, press Ctrl + Shift + E, or run devtools::check().

The R CMD Check runs through your package to ensure all required and suggested files exists, identifies potential typos/errors in the ROxygen2 code for each function, lets you know about imports you may have forgotten to add to your NAMESPACE, etc. It’s best practice to run this check before sharing/posting it for others.

Error Handling

Thoughtful and thorough error handling make your package user friendly. Coding best practices are to have as many checks at the beginning of your function as possible to catch common (or even uncommon) issues that are likely to happen, and to have a clear error or warning message for each check. This is often referred to as “Fail early”. That way, the user won’t be waiting for code to run, only for it to fail a few minutes later with a vague or misleading error message. If you don’t have error handling in your function, the error is often the first external function that failed to run and that has built-in error handling, rather than an error message at the line of code where the actual function failed.

Checking for Suggests

We’re going to add one more function to our testpackage, so we can talk about suggests (i.e. suggested packages that aren’t automatically installed when your package is installed) and error handling. The function will be called theme_IMD and will specify a custom theme for ggplot2, which is one of our suggests. Open a new script, name it theme_IMD, and copy the code chunk into it.

Notice the line that checks whether ggplot2 is installed on the user’s machine. If ggplot2 isn’t installed on the user’s machine, the function will fail immediately, and will print “Package ‘ggplot2’ needed for this function to work. Please install it.” in the console. This check only happens for functions that have this code in it. Note that for suggests, you also have to use the package:: approach to specify its functions. This is always a messy business for ggplot2…

Additional checks

There are tons of possible checks you can do. I often peek under the hood of packages that are well-designed to see what types of checks the pros actually use. You can view the code under the hood of a function by pressing the F2 key and clicking on the function.

Some examples of checks I commonly use are match.arg(), which makes sure arguments match between what the function allows and what the user specified. The stopifnot(class(argument) == 'classtype') is helpful to ensure numbers or logical arguments are specified properly. Other checks I often include are making sure the data sets that the function uses exist in the global environment. The code below is an excerpt from a function in our forestNETN package that compiles tree data. The first part of the code is checking the arguments specified by the user. The tryCatch() is looking for COMN_TreesByEvent object. If it exists, it will name it tree_vw. If it doesn’t it will exit the function and print the error quoted in the stop() into the console.

joinTreeData <- function(park = 'all', from = 2006, to = 2021, QAQC = FALSE, locType = c('VS', 'all'), panels = 1:4,
                         status = c('all', 'active', 'live', 'dead'), 
                         speciesType = c('all', 'native','exotic', 'invasive'),
                         canopyPosition = c("all", "canopy"), dist_m = NA, 
                         eventType = c('complete', 'all'), output = 'short', ...){

  # Match args and classes
  status <- match.arg(status)
  park <- match.arg(park, several.ok = TRUE,
                    c("all", "ACAD", "MABI", "MIMA", "MORR", "ROVA", "SAGA", "SARA", "WEFA"))
  stopifnot(class(from) == "numeric", from >= 2006)
  stopifnot(class(to) == "numeric", to >= 2006)
  locType <- match.arg(locType)
  stopifnot(class(QAQC) == 'logical')
  stopifnot(panels %in% c(1, 2, 3, 4))
  output <- match.arg(output, c("short", "verbose"))
  canopyPosition <- match.arg(canopyPosition)
  speciesType <- match.arg(speciesType)
  
  # Check for tree data in global environment
  tryCatch(tree_vw <- COMN_TreesByEvent,
           error = function(e){stop("COMN_TreesByEvent view not found. Please import view.")}
  )

  }

Error handling could take an entire day or more to cover fully, but that’s about all we have time for today. For more detail, Chapter 8 Conditions in the Advanced R book covers this topic quite thoroughly. Another useful resource is Chapter 2.5 Error Handling and Generation in the Mastering Software Development in R.


Debugging

Debugging is another big topic that we only have time to scratch the surface. For further reading, the best resource I’ve found on debugging is Chapter 22 Debugging in the Advanced R book.

Low-tech debugging

The simplest form of debugging is to load the dependencies and define objects in your global environment that will feed into the function, and then to run the code in the function under the hood. A simple example with the make_sppcode function is in the code chunk below. Note that I commented out the lines that start and end the function.

# dependencies
library(stringr)
library(dplyr)

#function args
data <- example_dat
sppname <- "Latin_Name"

#make_sppcode <- function(data, sppname){
  data$genus = word(data[,sppname], 1)
  data$species = ifelse(is.na(word(data[,sppname], 2)), "spp.", word(data[,sppname], 2))
  data <- mutate(data, sppcode = toupper(paste0(substr(genus, 1, 3),
                                                substr(species, 1, 3))))
  data2 <- select(dat, -genus, -species)
## Error in select(dat, -genus, -species): object 'dat' not found
#  return(data2)
#}
In the example above, we found that the line with select(dat, -genus, -species) had a typo: dat should have been data.


Using traceback()

There are several other built-in R functions that can help with debugging. The two I use the most often are traceback() and debug(). To show how traceback() works, let’s create a function that we know has an error. Copy this code to your R session and run it. You should see make_sppcode_error show up in your global environment after you run it.

make_sppcode_error <- function(data, sppname){
  data$genus = word(data[,sppname], 1)
  data$species = ifelse(is.na(word(data[,sppname], 2)), "spp.", word(data[,sppname], 2))
  data <- mutate(data, sppcode = toupper(paste0(substr(genus, 1, 3),
                                                substr(species, 1, 3))))
  data2 <- select(dat, -genus, -species)
  return(data2)
}

Now try to use the function:

make_sppcode_error(example_dat, sppname = "Latin_Name")
## Error in select(dat, -genus, -species): object 'dat' not found

It should fail, and the error message tells you that object ‘dat’ not found. You could then go look under the hood in your function to try to find where dat lived. Or, you can use traceback(), which shows you the code and line number that failed. If you have functions from your package that this function uses, and the other function is what failed, traceback() will tell you that too. Run the code below to see for yourself.

traceback()


The Magic of debug()

The debug() function allows you to look under the hood of a function and steps through the function code one line at a time. You can see the outputs of each line, and even interact with/change them to test how the function behaves. Once you get the hang of it, you’ll never go back to the low-tech debugging approach I described first.

The code chunk below shows how to start using the debug() function to step through the make_sppcode_error() function. It’s hard to show with R Markdown, but we’ll demo how to walk through the browser in debug() in a minute. Once you run the code below, your console will show a message that starts with “debugging in: make_sppcode_error(data,”Latin_Name”). You’ll also see a Browse[2]> below, where you can enter one of several options:

  • n to execute the next line of code
  • s to step into the the function calls in the next line of code. This means that if there’s a function in that next line, it will open a browser to debug that function. This is really helpful when you have another package function within that function that’s failing, because you’re able to interact with it in the exact scenario where it’s failing.
  • c to continue to the end function and exit the browser
  • f to run through to the end of the function but keep the browser open
  • Q to exit the browser and stop debugging.
debug(make_sppcode_error)
make_sppcode_error(data, "Latin_Name")

In our case, we’ll enter n and step our way through the function, printing the head(data) to make sure it looks the way we expect. Eventually we’ll find that the function fails on the select(dat, ...) line. Then we’ll exit out by pressing Q or c.



Share Package on GitHub

Post to GitHub

We’re finally be ready to post our package to GitHub. You’ll want to run the build (Ctrl + Shift + B or “Install and Restart” in Build tab) and check (Ctrl + Shift + E or “Check” in Build tab) one last time. If all checks out, then go to the Git tab in your environment pane. Click on the Commit button, which will open a pane like below. Go ahead and check all but the gitignore files, add a commit message, and then press commit. This will stage the changes. The last step is pushing to GitHub using the green up arrow in the top right. After you push to GitHub, you should see that your GitHub repo in your browser has the new files/changes in the main branch. You might need to refresh the page (F5).

Install from GitHub

The package is is currently only installed on your local machine. The easiest way for others to use your package are for them to install the package using devtools. This will install the package based on the code in the main branch. Packages can also be installed from a branch by adding ref = 'branchname' to the function call.

The other way to install the package is for the user to go through the process of cloning the GitHub repository to their local machine and then using the build tool to build the package on their machine. This is the preferred method only if other users of your package are also interested in contributing to the package. Be sure to use branches to keep the main branch protected, so that install_github() continues to work as you’re developing your package.

Resources

  • R Packages book: The 2nd Edition is currently under development, with lots of updates to package development workflow (e.g. the usethis package!) and improved examples being added frequently. You can’t go wrong when Hadley Wickham and Jenny Bryan team up on a project.
  • Advanced R is an excellent resource to learn the more advanced skills and concepts with R programming. The book is pretty advanced and assumes you already have a solid foundation in R programming.
  • Mastering Software Development in R covers a lot of the same topics as Advanced R that we cover in this training, but is a gentler introduction and doesn’t get quite as advanced.


Resources

General Resources

  • NPS_IMD_Data_Science_and_Visualization > Community of Practice is an IMD work group that meets once a month talk about R and Data Science. There are also notes, materials and recordings from previous meetings, a Wiki with helpful tips, and the chat is a great place to post questions or cool tips you’ve come across.
  • R for Data Science First author is Hadley Wickham, the programmer behind the tidyverse. There’s a lot of good stuff in here, including a chapter on using R Markdown, which is what we used to generate this training website.
  • Mastering Software Development in R First author is Roger Peng, a Biostatistics professors at John Hopkins, who has taught a lot of undergrad/grad students how to use R. He’s also one of the hosts of Not So Standard Deviations podcast. His intro to ggplot is great. He’s also got a lot of more advanced topics in this book, like making functions and packages.
  • Advanced R Yet another book by Hadley Wickham that helps you understand more about how R works under the hood, how it relates to other programming languages, and how to build packages.

Resources by Day:

Day 1: Data Retrieval

Resources

  • rFIA website which contains background on the package, along with tutorials for using the data.
Using ggplot2


Visualizing Spatial Data
  • Geocomputation in R is an excellent online book on the sf and tmap packages.
  • R Studio’s leaflet for R page shows the basics of how to use leaflet.
  • R-spatial mapping in ggplot The r-spatial.org site is a blog with lots of interesting topics covered. The blog I linked to helped me figure out how to map shapefiles in ggplot.


Day 3: Data R Markdown

R Markdown
  • R Markdown: The Definitive Guide. This is Yihui Xie’s book that’s available free online and for purchase. Yihui is one of the main developers at RStudio working on R Markdown and related packages (e.g. knitr, pagedown, bookdown). The online book is searchable, and is often the first place I check, when I can’t remember how to do something.
  • R Markdown Cookbook. Another great book with Yihui Xie as an author. This book is free online and has a lot of small bite-size tips for customizing R Markdown.
  • R for Data Science, chapters 27 and 29. The book itself is worth a read cover to cover. The chapters on R Markdown are also very helpful on their own.
  • RStudio’s R Markdown page: Includes several short videos and lots of tutorials, articles, and a gallery to give an idea of the many things you can do.
  • R Markdown Cheat Sheet.
  • kableExtra vignette: Lots of great examples of the different styling/formats you can use with kables and the kableExtra package.
  • R Markdown for Scientists is an online book with a lot of helpful tips for people like us.
HTML/CSS
  • W3 Schools HTML page: This website includes helpful tutorials on HTML and CSS, includes executable examples for just about every HTML tag you can think of, and shows how different browsers render content.
  • HTML & CSS design and build websites: This book costs about $15 (cheaper if you can find a used copy) and was a very helpful introduction and continual resource, particularly for working with CSS. There’s a Javascript & JQuery book by the same author that’s equally well-done, but was a much steeper learning curve.
LaTeX
  • CTAN.org: This is LaTeX’s version of CRAN.
  • Overleaf Online LaTex Editor: Includes a short guide to learn LaTeX, examples for most common uses of LaTeX, and has a built in editor that you can execute code in.
MS Office

Day 4: R Packages

  • R Packages book: The 2nd Edition is currently under development, with lots of updates to package development workflow (e.g. the usethis package!) and improved examples being added frequently. You can’t go wrong when Hadley Wickham and Jenny Bryan team up on a project.
  • Advanced R is an excellent resource to learn the more advanced skills and concepts with R programming. The book is pretty advanced and assumes you already have a solid foundation in R programming.
  • Mastering Software Development in R covers a lot of the same topics as Advanced R that we cover in this training, but is a gentler introduction and doesn’t get quite as advanced.

Code printout

This tab prints all of the code chunks in this file in one long file.

knitr::opts_chunk$set(warning=FALSE, message=FALSE)
rm(list = ls())
rm(list = ls())
rm(list = ls())


rm(list = ls())
#--------------------
#       Prep
#--------------------
packages <- c("tidyverse", "DBI", "odbc", "readr", "dbplyr", # Reading Databases
              "GADMTools", "sf", "tmap", # GIS in R
              "dataRetrieval", "lubridate", "jsonlite", "httr", # Aquarius
              "rFIA" # for USFS FIA data
              )

install.packages(setdiff(packages, rownames(installed.packages())))  
install.packages('terra', repos='https://rspatial.r-universe.dev')
install.packages('tmap')
packages <- c("tidyverse", "DBI", "odbc", "readr", "dbplyr", "GADMTools", "sf", "tmap", 
              "dataRetrieval", "lubridate", "jsonlite", "httr", "rFIA")

installed_packages <- packages %in% installed.packages() # check which packages are installed
if (length(packages[!installed_packages]) > 0){
  install.packages(packages[!installed_packages], dep = TRUE)} # if some are missing, install them

lapply(packages, library, character.only = TRUE) 
packages <- c("tidyverse", "parallel", "microbenchmark")

install.packages(setdiff(packages, rownames(installed.packages())))  

packages <- c("tidyverse", "parallel", "microbenchmark")

installed_packages <- packages %in% installed.packages() # check which packages are installed
if (length(packages[!installed_packages]) > 0){
  install.packages(packages[!installed_packages], dep = TRUE)} # if some are missing, install them

lapply(packages, library, character.only = TRUE) 
packages <- c("rmarkdown", "tidyverse", "knitr", "kableExtra", "DT")

install.packages(setdiff(packages, rownames(installed.packages())))  

packages <- c("rmarkdown", "tidyverse", "knitr", "kableExtra", "DT")

installed_packages <- packages %in% installed.packages() # check which packages are installed
if (length(packages[!installed_packages]) > 0){
  install.packages(packages[!installed_packages], dep = TRUE)} # if some are missing, install them

lapply(packages, library, character.only = TRUE) 
devtools::install_github("haozhu233/kableExtra")
packages <- c("devtools", "usethis", "roxygen2","stringr", "dplyr")

install.packages(setdiff(packages, rownames(installed.packages())))  

packages <- c("devtools", "usethis", "roxygen2", "stringr", "dplyr")

installed_packages <- packages %in% installed.packages() # check which packages are installed
if (length(packages[!installed_packages]) > 0){
  install.packages(packages[!installed_packages], dep = TRUE)} # if some are missing, install them

lapply(packages, library, character.only = TRUE) 
#----------Connect to Access------------#

db_path <- "path/to/database.mdb"
conn_string <- paste0("Driver={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=", db_path)

conn <- DBI::dbConnect(odbc::odbc(), .connection_string = conn_string)


#---------------Read the data---------------#

data <- dplyr::tbl(conn, "tbl_MyAccessTable") %>%
  collect()

# Common tidying operations:
data <- mutate_if(data, is.character, trimws) %>%
  mutate_if(is.character, dplyr::na_if, "")

# Close the connection as soon as you no longer need it, otherwise you will get weird errors
dbDisconnect(conn)
library(DBI)
library(odbc)
library(readr)
library(magrittr)
library(dplyr)
library(dbplyr)

#----------Option 1: Hard code db connection info (not great)------------#

sql_driver <- "SQL Server Native Client 11.0"
my_server <- "SERVER\\NAME"
my_database <- "Database_Name"

conn <- dbConnect(Driver = sql_driver, Server = my_server, Database = my_database, Trusted_Connection = "Yes", drv = odbc::odbc())


#----------Option 2: Read db connection info from shared drive------------#

# This is just a csv with columns for Driver, Server, and Database
params <- readr::read_csv("path/to/csv/on/shared/drive/stlk-database-conn.csv")
params$Trusted_Connection <- "Yes"
params$drv <- odbc::odbc()

conn <- do.call(dbConnect, params)


#----------Option 3: Create a user DSN (easy to use in R, but has to be configured for each user on each computer)------------#

# Create DSN: https://docs.microsoft.com/en-us/sql/relational-databases/native-client-odbc-how-to/configuring-the-sql-server-odbc-driver-add-a-data-source?view=sql-server-ver15
dsn_name <- "myDSN"
conn <- dbConnect(odbc(), dsn_name)


#---------------Read the data---------------#

data <- dplyr::tbl(conn, dbplyr::in_schema("analysis", "Chemistry")) %>%  # The first argument to in_schema is the schema name (SQL default is dbo) and the second is the table name.
  collect()

# Common tidying operations:
  data <- mutate_if(data, is.character, trimws) %>%
    mutate_if(is.character, dplyr::na_if, "")

# Close the connection as soon as you no longer need it, otherwise you will get weird errors
dbDisconnect(conn)
library(GADMTools) # for downloading spatial data
library(sf) # for working with simple features
library(tmap) # for mapping spatial data
install.packages('terra', repos='https://rspatial.r-universe.dev')
library(GADMTools)

# Level 0: National
gadm_sf_loadCountries("USA", level = 0, basefile = "./data/")
us_bound <- readRDS("./data/USA_adm0.sf.rds")
us_bound_utm <- st_transform(st_as_sf(us_bound, crs = 4326), 5070)
st_write(us_bound_utm, "./data/US_Boundary.shp", append = F)

# Level 1: State/Province
gadm_sf_loadCountries("USA", level = 1, basefile = "./data/")
us_states <- readRDS("./data/USA_adm1.sf.rds")
us_state_utm <- st_transform(st_as_sf(us_states, crs = 4326), 5070)
st_write(us_state_utm, "./data/US_States.shp", append = F)

# Level 2: County/district
gadm_sf_loadCountries("USA", level = 2, basefile = "./data/")
us_cnty <- readRDS("./data/USA_adm2.sf.rds")
us_cnty_utm <- st_transform(st_as_sf(us_cnty, crs = 4326), 5070)
st_write(us_cnty_utm, "./data/US_Counties.shp", append = F)
library(GADMTools)
library(sf)

# Level 0: National
us_bound_utm <- st_read("./data/US_Boundary.shp")

# Level 1: State/Province
us_state_utm <- st_read("./data/US_States.shp")

# Level 2: County/district
us_cnty_utm <- st_read("./data/US_Counties.shp")
plot(us_state_utm[1])
# Load park tiles
NPSbasic = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck58pyquo009v01p99xebegr9/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
  
NPSimagery = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck72fwp2642dv07o7tbqinvz4/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
  
NPSslate = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck5cpvc2e0avf01p9zaw4co8o/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
  
NPSlight = 'https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck5cpia2u0auf01p9vbugvcpv/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg'
# Map 
mn_cnty <- st_as_sf(us_cnty_utm[us_cnty_utm$NAME_1 == "Minnesota",])

mn_map <- tm_shape(mn_cnty, projection = 5070) +
          tm_borders(col = "black")

tmap_options(basemaps = c(Map = NPSbasic,
                          Imagery = NPSimagery,
                          Light = NPSlight,
                          Slate = NPSslate))
tmap_mode('view') # set to interactive mode
mn_map
tmap_mode('plot') # return to static mode
library(purrr)
# Create function to iterate download
downFIA <- function(state, table){
  csv_name <- paste(state, table, sep = "_")
  csv_link <- paste0('http://apps.fs.usda.gov/fia/datamart/CSV/', csv_name, '.csv')
  assign(csv_name, read.csv(csv_link))
  write.csv(csv_name, paste0("./data/", csv_name, '.csv'))
}

# Create list of states of interest
states = c("CT", "RI", "DE")

# Use purrr::map to download each state's PLOT table.
map(states, ~downFIA(., "PLOT"))

library(rFIA)

# Download CT and RI state data tables to data folder
getFIA(states = c('DE', 'RI'), dir = "./data")

# Download reference tables (ie lookup tables)
getFIA(states = 'ref', tables = c('SPECIES', 'PLANT_DICTIONARY'))
library(rFIA)
# Import all states in R that's in the data folder
fia_all <- readFIA("./data")
names(fia_all) # Print table names
table(fia_all$PLOT$STATECD) # View number of plots by state code. Should be 2.
table(fia_all$PLOT$INVYR) # Range of inventory years in the data

# Import RI data only
fia_ri <- readFIA("./data", states = 'RI')
names(fia_ri)
head(fia_ri$PLOT)
table(fia_ri$PLOT$STATECD) # View number of plots by state code. Now only 1.
table(fia_ri$PLOT$INVYR)

# Clip all data to most recent sample
all_recent <- clipFIA(fia_all, mostRecent = TRUE)

# Print list of counties in RI
countiesRI 

# Clip RI to Washington County
ri_Wash <- clipFIA(fia_ri, mask = countiesRI[5,], mostRecent = F)
fia_ri_rec <- clipFIA(fia_ri, mostRecent = TRUE)

# State level tree population estimates
tpa_ri <- tpa(fia_ri)

# Plot level tree population estimates for most recent survey
tpa_ri_plot <- tpa(fia_ri_rec, byPlot = TRUE)

# Species level and size class level tree population estimates
tpa_ri_sppsz <- tpa(fia_ri_rec, bySpecies = TRUE, bySizeClass = TRUE)

# by Ownership
tpa_ri_own <- tpa(fia_ri_rec, grpBy = OWNGRPCD)
head(tpa_ri)
plotFIA(tpa_ri, y = TPA, plot.title = "Trees per acre")

#the following toolbox can be pulled directly from my github:
#source("Aquarius basics.R") or....
source("https://raw.githubusercontent.com/AndrewBirchHydro/albAquariusTools/main/Aquarius%20basics.R")
timeseries$connect("https://aquarius.nps.gov/aquarius", "aqreadonly", "aqreadonly")
publishapiurl='https://aquarius.nps.gov/aquarius/Publish/v2'

#we also will need to load a few packages. If you do not have these, you will need to install them in the console.
library(dataRetrieval)
library(lubridate)
library(ggplot2)
library(jsonlite)
library(httr)
#assign the network name of interest
Network <- "Southern Colorado Plateau Network"

#this function will print the available locations in that network
Print_Locations(Network = Network)

#if you want a dataframe of the locations, you can assign it to 'Locations_in_network' here:
Locations_in_Network <- as.data.frame(Print_Locations(Network = Network))


#enter the location of interest to get a list of datasets:
Print_datasets(Location = "BANDRIT01")

#you can also print available metadata for a given location based on it's identifier:
Print_metadata(Location = "BANDRIT01")

#this line will pull the dataset from Aquarius
raw_record <- Get_timeseries2(record = "Water Temp.AquaticLogger@BANDRIT01")

#this function will "clean it up"
temp_data <- TS_simplify(data = raw_record)
# some summary information
head(temp_data)
summary(temp_data)

# a quick plot
ggplot(temp_data, aes(x = date_time, y = value)) +
  geom_path() +
  theme_bw() +
  xlab("Time") +
  ylab("Water Temperature (C)")

#00060- USGS code for discharge

siteNo <- "08313350"
parameter <- "00060"

#for start and end dates, you can enter basically all of known time to get whatever data is available, or narrow it down to a particular window. In this instance, lets go from 1900 to 2025 to ensure we pull everything available
start.date <- "1900-01-01"
end.date <- "2025-01-01"


NWIS_query <- readNWISuv(siteNumbers = siteNo,
                     parameterCd = parameter,
                     startDate = start.date,
                     endDate = end.date)


colnames(NWIS_query) <- c("agency", "identifier", "date_time", "Q_cfs", "time_zone")

head(NWIS_query)
summary(NWIS_query)

# we'll plot both the water temperature and discharge records for a quick visual comparison
ggplot(data = NWIS_query, aes(x = date_time, y = Q_cfs)) +
  geom_path() +
  theme_bw() +
  scale_y_log10() +
  xlab("Time") +
  ylab("Discharge (cfs)")

ggplot(temp_data, aes(x = date_time, y = value)) +
  geom_path() +
  theme_bw() +
  xlab("Time") +
  ylab("Water Temperature (C)")


# we will enter FALSE for all.x and all.y, as we only want the period where the records overlap one another:
combined_record <- merge(x = NWIS_query, y = temp_data, 
                         by = "date_time", all.x = F, all.y = F)

head(combined_record)
summary(combined_record)

# lets take a basic look at the time series together- not that since they are on different scales, a secondary axis is required
ggplot(combined_record, aes(x = date_time, y = Q_cfs)) +
  geom_path(color = "blue") +
  geom_path(color = "red", aes(x = date_time, y = value * 500)) +
  theme_bw() +
  scale_y_continuous(name = "Discharge (cfs)", sec.axis = sec_axis(~ . / 500, name = "Water Temperature (C)"))

# lets plot temperature and discharge against one another to see their relationship:
ggplot(combined_record, aes(x = Q_cfs, y = value)) +
  geom_point() +
  theme_bw() +
  xlab("Discharge (cfs)") +
  ylab("Water Temperature (C)") +
  scale_x_log10()
# This will require the lubridate package
library(lubridate)

# this will create a new column with the numeric month derived from the date_time column as a factor
combined_record$month <- as.factor(month(combined_record$date_time))
# this will create a new column with the year derived from the date_time column as a factor
combined_record$year <- as.factor(year(combined_record$date_time))


ggplot(combined_record, aes(x = Q_cfs, y = value, color = month)) +
  geom_point() +
  theme_bw() +
  xlab("Discharge (cfs)") +
  ylab("Water Temperature (C)") +
  scale_x_log10() +
  ggtitle("2013") +
  facet_grid(year ~ .)
rm(list = ls())
#run these in the console without the () to see what lies underneath
mean

lm

set.seed(12345) #gives everybody the same data

d<-c(floor(runif(100)*100),NA) #generate random data

mean(x=d) #unexpected result
mean2<-  #Tell [R] that I want this new function to be named "mean2"
       function(x){  #the function consists of 1 parameter named x (aka the data) The { begins the function source code / expressions. 
                   mean(x,na.rm=T) #in the mean function change the default for na.rm=T
                   } #close function

mean2(x=d) #more expected result

mean2(x=d, na.rm=F)

mean3<- function(x,na.rm=T){mean(x=x, na.rm=na.rm)} 

mean3(d)
mean4<- function(x,na.rm){#very minor change. I deleted the initial parameter value
                          mean(x=x, na.rm=na.rm)} 
mean4(d)
mean5<- function(x,na.rm){mean(x=x, na.rm=na.rm)} #always works

mean5<- function(x,na.rm) mean(x=x, na.rm=na.rm) #only works on one line

library(ggplot2);library(magrittr)
#get that data
fNames<-c("APIS01_20548905_2021_temp.csv",
          "APIS02_20549198_2021_temp.csv",
          "APIS03_20557246_2021_temp.csv",
          "APIS04_20597702_2021_temp.csv",
          "APIS05_20597703_2021_temp.csv")

fPaths<-paste0("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/", fNames) 

HoboList<-lapply(fPaths, FUN=read.csv, skip=1, header=T)%>% #1. read hobo data into a list
          lapply(., "[",,1:4)%>% #2. Grab only first 3 columns. Empty comma is not an error
          lapply(., setNames, c("idx","DateTime","T_F","Lum"))%>% #3. set col names
          lapply(., dplyr::mutate, DateTime2=as.POSIXct(DateTime, "%m/%d/%y %H:%M:%S", tz="UCT"))%>%#4. format datetime in new variable. 
          setNames(., fNames) #5. name each one for tracking
hobo_summary <- function(x, col=3) {   
   funs <- c(mean, median, sd, mad, IQR) #list of functions
             unlist(                     #unlist simplifies somewhat ugly output
             y<-lapply(funs, function(f){f(x[,col], na.rm = TRUE)})%>%
                setNames(.,c("mean", "median", "sd", "mad", "IQR"))
             )
} #credit to [5] for this general example idea

summarized_data<-lapply(HoboList, FUN=hobo_summary)%>%
                        do.call("rbind", . )%>%   
                        as.data.frame( . ) 
ggplotCustom<-function(i, j, pattern=".csv", replacement="_plot.pdf", path=choose.dir(), device="pdf", height=5, width=5,units="in"){
              p<-ggplot(data = j[[i]], aes(x=DateTime2, y=T_F))+
                 geom_point()+
                 ggtitle(names(j)[i])
              ggsave(filename=gsub(pattern=pattern, replacement = replacement, names(j)[i]),
                     path=path, plot=p,  device=device, 
                     height=height, width=width, units=units)
            }

#do some quick testing
ggplotCustom(i=1, j=HoboList, path= "C:/Users/tparr/Downloads/Training_Output/") #test to see if function is working for a positive case

ggplotCustom(i=3, j=HoboList, path= "C:/Users/tparr/Downloads/Training_Output/") #test to see how it behaves on a negative case

#now iterate
lapply(seq_along(HoboList), FUN=ggplotCustom, HoboList, path="C:/Users/tparr/Downloads/Training_Output/")


lm_ls<-function(data,x){mod<-lm(x, data=data); return(mod)}
modlist<-lapply(HoboList[c(2,4,5)], lm_ls, T_F~Lum) #I know some are missing light data
lapply(modlist, summary)
lapply(modlist, plot)
lapply(modlist, coef)
lm_stats<-function(mod){ mod_sum<-summary(mod) #may or may not be worthwhile
                         out<-data.frame(
                         intercept= coef(mod)[[1]],
                         slope= coef(mod)[[2]],
                         slp_pval=mod_sum$coefficients[,4][[2]], #see what happens if you run this without the [[2]]
                         R2_adj= mod_sum$adj.r.squared,
                         mod_pval= mod_sum$fstatistic %>% {unname(pf(.[1],.[2],.[3],lower.tail=F))})
                        return(out) 
}

m<-lapply(modlist, lm_stats)%>%
         do.call(rbind,.)%>%
         dplyr::mutate(.,id=rownames(.))%>%
         magrittr::set_rownames(.,1:nrow(.))

HoboList2<-c(rep(HoboList,5)) #make the dataset larger 

plan("multisession", workers=parallel::detectCores()-1) #initiate a multicore session, the number of cores to use to 1 fewer than the max detected. Reduces chance of overwhelming the system.
microbenchmark::microbenchmark(
"sequential"={lapply(seq_along(HoboList2), FUN=ggplotCustom, HoboList2, path="C:/Users/tparr/Downloads/Training_Output/")},
"parallel"={future_lapply(seq_along(HoboList2), FUN=ggplotCustom, HoboList2, path="C:/Users/tparr/Downloads/Training_Output/")},
times=5,
unit="s"
)
plan("sequential") #close the multicore session.


library(tidyverse)
library(lubridate)
library(modelr)


## First statement makes i the "index" variable  it will run once for each i
x <- NA
for (i in 1:100) {

  ## get 10000 random numbers from a Poisson distribution with a mean i  (changes)
  Samples <- rpois(10000, i)

  # get 95% upper qauntile - e.g. the number that is higher than 95% of the other samples
  Q95 <- quantile(Samples, 0.95)

  ## save that to vector "x" using [i] to make sure it goes to the right location
  x[i] <- Q95
}

# plot our results

plot(x)


rm(i, Q95, x, Samples)


#file names
fNames <- c(
  "APIS01_20548905_2021_temp.csv",
  "APIS02_20549198_2021_temp.csv",
  "APIS03_20557246_2021_temp.csv",
  "APIS04_20597702_2021_temp.csv",
  "APIS05_20597703_2021_temp.csv"
)

# Make path to files
fPaths <- paste0("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/", fNames)

fPaths<- set_names(fPaths, c("APIS01", "APIS02", "APIS03", "APIS04", "APIS05"))
fPaths  # This thing is now a vector where each element has a name

## import the site data into a list where each element comes from a different file
intensity_data_raw <- fPaths %>% map( ~ read_csv(file = .x,  skip = 1))

class(intensity_data_raw) # its a list

length(intensity_data_raw) # 5 elements

class(intensity_data_raw[[1]]) # each element is a data.frame (well, tibble actually)

intensity_data_raw %>% map_dbl(~ nrow(.x))

intensity_data_raw %>% map(~ colnames(.x))


library(lubridate)

intensity_data_raw <- intensity_data_raw %>% discard(~ nrow(.x) == 0)

Fix_Data <- function(data) {
  data %>%
    select(starts_with("Date") | starts_with("Temp") | starts_with("Intensity")) %>%
    rename("Date_Time" = starts_with("Date"), "Temp_F" = starts_with("Temp"), "Intensity" = starts_with("Intensity")) %>%
    mutate(Temp_C = 5 * (Temp_F - 32) / 9, Date_Time = mdy_hms(Date_Time), Date = date(Date_Time))
}


intensity_data <- intensity_data_raw %>% map(~ Fix_Data(.x))


Summarise_Data <- function(data) {
  data %>%
    group_by(Date) %>%
    summarise(across(where(is.numeric), mean)) 
}

summary_data <- intensity_data %>% map(~ Summarise_Data(.x))
### Oops - some of the intensity data is 0, fix this for later use in gamma glm().

Zero_Intensity_Fix <- function(data) {
  data %>%
    filter(across(any_of("Intensity"), ~ .x != 0))
}

summary_data <- summary_data %>% map(~ Zero_Intensity_Fix(.x))

Summary_Graph <- function(data) {
  data <- data %>%
    pivot_longer(cols = where(is.numeric), names_to = "Measure", values_to = "Value") %>%
    filter(Measure != "Temp_F")

  Graph <- ggplot(data, aes(x = Date, y = Value, group = Measure, color = Measure)) +
    geom_point() +
    facet_grid(Measure ~ ., scales = "free_y")

  return(Graph)
}

Graphs1 <- summary_data %>% map(~ Summary_Graph(.x))


Intensity_Regression <- function(data, ...) {
  glm(Intensity ~ Temp_C, data = data, ...)
}


Gauss_Reg <- summary_data %>%
  map_if(~ all(c("Intensity", "Temp_C") %in% colnames(.x)),
    ~ Intensity_Regression(.x, family = gaussian),
    .else = ~ NA
  )


Gamma_Reg <- summary_data %>%
  map_if(~ all(c("Intensity", "Temp_C") %in% colnames(.x)),
    ~ Intensity_Regression(.x, family = Gamma(link="log")),
    .else = ~NA
  )

library(modelr)

summary_data <- map2(.x = summary_data, .y = Gauss_Reg, .f = ~ {
  if ("Intensity" %in% colnames(.x)) {
    add_predictions(data = .x, model = .y, var = "Pred")
  } else {
    .x
  }
})


Pred_Graph_Data<- bind_rows(summary_data, .id="Site") %>% filter(!is.na(Pred))

ggplot(Pred_Graph_Data, aes(x=Date, y=Intensity))+
  geom_point()+
  geom_line(aes(x=Date, y=Pred, color="red"))+
  facet_grid(Site~., scales="free_y")


# A vector of numbers
x <- 1:10

# Arithmetic is vectorized
x + 1
# a character vector (letters)

x <- letters[1:10]

# paste is vectorized
paste("Letter:", x)

# a logical vector

x <- c(TRUE, TRUE, FALSE, TRUE, FALSE, FALSE)

# logical operations are vectorized

TRUE & x

# a date vector

library(lubridate)

x <- mdy("02/07/2022", "02/08/2022", "02/09/2022", "02/10/2022")
x
## adding 1 shows you the next day
x + 1

rm(x)

library(tidyverse)
library(lubridate)
library(microbenchmark)

# Recreate intensity data
fNames <- c(
  "APIS01_20548905_2021_temp.csv",
  "APIS02_20549198_2021_temp.csv",
  "APIS03_20557246_2021_temp.csv",
  "APIS04_20597702_2021_temp.csv",
  "APIS05_20597703_2021_temp.csv"
)

fPaths <- paste0("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/", fNames)

intensity_data_raw <- set_names(fPaths, c("APIS01", "APIS02", "APIS03", "APIS04", "APIS05")) %>%
  map(~ read_csv(file = .x, , skip = 1)) %>%
  discard(~ nrow(.x) == 0)


# Original function
Fix_Data_Lubridate <- function(data) {
  data %>%
    select(starts_with("Date") | starts_with("Temp") | starts_with("Intensity")) %>%
    rename("Date_Time" = starts_with("Date"), "Temp_F" = starts_with("Temp"), "Intensity" = starts_with("Intensity")) %>%
    mutate(Temp_C = 5 * (Temp_F - 32) / 9, Date_Time = mdy_hms(Date_Time), Date = date(Date_Time))
}


# New version using as.POSIXct()
Fix_Data_POSIXct <- function(data) {
  data %>%
    select(starts_with("Date") | starts_with("Temp") | starts_with("Intensity")) %>%
    rename("Date_Time" = starts_with("Date"), "Temp_F" = starts_with("Temp"), "Intensity" = starts_with("Intensity")) %>%
    mutate(
      Temp_C = 5 * (Temp_F - 32) / 9, Date_Time = as.POSIXct(Date_Time, "%m/%d/%y %H:%M:%S", tz = "UCT"),
      Date = date(Date_Time)
    )
}

# new version using strptime
Fix_Data_strptime <- function(data) {
  data %>%
    select(starts_with("Date") | starts_with("Temp") | starts_with("Intensity")) %>%
    rename("Date_Time" = starts_with("Date"), "Temp_F" = starts_with("Temp"), "Intensity" = starts_with("Intensity")) %>%
    mutate(
      Temp_C = 5 * (Temp_F - 32) / 9, Date_Time = strptime(Date_Time, "%m/%d/%y %H:%M:%S", tz = "UCT"),
      Date = date(Date_Time)
    )
}

# Do Comparison
mb<-microbenchmark(
   Lubridate = intensity_data_raw %>% map(~ Fix_Data_Lubridate(.x)),
   POSIXct = intensity_data_raw %>% map(~ Fix_Data_POSIXct(.x)),
   Strptime = intensity_data_raw %>% map(~ Fix_Data_strptime(.x)),
  times = 10L, unit = "ms"
)

mb

boxplot(mb)
install.packages("rmarkdown") 
#------------------
# R Markdown I
#------------------
knitr::include_graphics("./images/YAML.png")
<h1>First-level header</h1>

<h2>Second-level header</h2>

...

<h6>Sixth-level header</h6>

<i>italic</i>

<b>bold</b>

superscript<sup>2</sup>

endash: &ndash;

Example sentence: <i>Picea rubens</i> is the dominant species in <b>Acadia National Park</b>.

#### This is how to specify header level 4, which renders as:
numspp <- round(mean(runif(50, 0, 100)), 1)
library(tidyverse)
wetdat <- read.csv(
  "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/data/ACAD_wetland_data_clean.csv")
wetsum <- wetdat %>% group_by(Site_Name, Year) %>% 
  summarize(num_inv = sum(Invasive), num_prot = sum(Protected), 
            .groups = 'drop')
devtools::install_github("haozhu233/kableExtra")
library(kableExtra) # for extra kable features
library(knitr) # for kable
wet_kable <- kable(wetsum, format = 'html') %>% # if using pdf, need LaTeX
             kable_styling(full_width = FALSE, bootstrap_options = 'striped') #kableExtra function
wet_kable
wet_kable
# custom kable function that requires data, column names and caption
make_kable <- function(data, colnames = NA, caption = NA){
  kab <- kable(data, format = 'html', col.names = colnames, align = 'c', caption = caption) %>% 
      kable_styling(fixed_thead = TRUE, 
                    bootstrap_options = c('condensed', 'bordered', 'striped'), 
                    full_width = FALSE, 
                    position = 'left', 
                    font_size = 12) %>%
      row_spec(0, extra_css = "border-top: 1px solid #000000; border-bottom: 1px solid #000000;") %>% 
      row_spec(nrow(data), extra_css = 'border-bottom: 1px solid #000000;')

}

# use function with wetsum data
wetkab2 <- make_kable(wetsum, 
                      colnames = c("Site", "Year", "# Invasive", "# Protected"),
                      caption = "Table 1. Summary of wetland data") %>% 
           scroll_box(height = "250px")
wetkab2
wetkab3 <- make_kable(wetsum, 
                      colnames = c("Site", "Year", "# Invasive", "# Protected"),
                      caption = "Table 1. Summary of wetland data") %>% 
           row_spec(0, extra_css = "border-top: 1px solid #000000; border-bottom: 1px solid #000000;") %>% 
           column_spec(3, background = ifelse(wetsum$num_inv > 0, "orange", FALSE)) %>% 
           collapse_rows(1, valign = 'top') 
wetkab3
ans_tab <- kable(wetsum, format = 'html',
                 align = c('l', 'c', 'c', 'c')) %>%  # Answer 1
           kable_styling(fixed_thead = TRUE, 
                 full_width = FALSE, 
                 position = 'left') # Answer 2 

ans_tab
library(DT)
wetdt <- datatable(wetsum, colnames = c("Site", "Year", "# Invasive", "# Protected"))
wetdt
wetdt
# modify pageLength and lengthMenu
wetdt2 <- datatable(wetsum, colnames = c("Site", "Year", "# Invasive", "# Protected"),
                    width = "40%",
                    options = list(pageLength = 10,
                                   lengthMenu = c(5, 10, 20))
                    )

wetdt2

wetdt3 <- datatable(data.frame(wetsum, "Notes" = NA), 
                    width = "40%",
                    colnames = c("Site", "Year", "# Invasive", "# Protected", "Notes"),
                    options = list(pageLength = 10),                        
                    class = 'cell-border stripe',
                    filter = list(position = c('top'), clear = FALSE),
                    editable = list(target = 'cell', disable = list(columns = 1:4))) %>% 
          formatStyle(3, backgroundColor = styleInterval(0, c('white', "orange")))
wetdt3

wetdt3
<img src="./images/map_of_parks.jpg" alt = "Map of Region-1 IMD parks" width="400px">
photos <- list.files("./images/photopoints", full.names = TRUE)
include_graphics(photos)

knitr::opts_chunk$set(fig.height=4, fig.width=6, fig.align='left')
library(dplyr)
library(ggplot2)
devtools::source_url("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/Generate_fake_invasive_data_and_plots.R")

invplot_all
invplot_ACAD
# load packages
library(ggplot2)
library(dplyr)
library(kableExtra)

# Import data from github
devtools::source_url("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/Generate_fake_invasive_data_and_plots.R")

table(invdata$park, invdata$cycle) # Should see 3 parks each with 3 cycles

# Create invplot_fun to plot different params
invplot_fun <- function(data, park){
  p <-   
  ggplot(data %>% filter(park == park), 
         aes(x = cycle, y = inv_cover))+
    geom_jitter(color = "#69A466", width = 0.1) +
    geom_boxplot(aes(group = cycle), width = 0.18, lwd = 0.4, alpha = 0)+
    theme_bw()+
    theme(panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank()) +
    labs(x = 'Cycle', y = "Invasive % Cover") +
    scale_x_continuous(limits = c(0.9, 3.1), breaks = c(1, 2, 3)) 
return(p)
}

# Plot invasives by cycle for specified park
invplot <- invplot_fun(invdata, park = "ACAD") 

# Filter invasive data to only include specified park and cycle
parkinv <- invdata %>% 
  filter(park == "ACAD" & cycle == 3) %>% 
  select(-park)                    

parktab <- kable(parkinv, format = 'html',
                 col.names = c("Plot", "Cycle", "% Invasive Cover", "% Canopy Cover"),
                 caption = "Table of forest plots in Acadia National Park" 
                 ) %>% 
           kable_styling(full_width = FALSE, bootstrap_options = 'striped') %>% 
           scroll_box(height = "400px", width = "420px")
invplot
parktab
# load packages
library(purrr)
library(rmarkdown)

# Set in and out directories and file to iterate on
indir <- c("./Markdown_examples/")
outdir <- c("./Markdown_examples/output/") # Included, to see how to change out dir.
rmd <- "example_for_params.Rmd"

# Set up parameter lists. Have to repeat 3 times because each park has 3 cycles
park_list <- rep(c("ACAD", "MABI", "SARA"), each = 3)
park_long_list <- rep(c("Acadia National Park", 
                    "Marsh-Billings-Rockefeller NHP", 
                    "Saratoga National Historical Park"), each = 3)
cycle_list = rep(c(1, 2, 3), 3)

# Create the render function that we will iterate with
render_fun <- function(parkcode, parklong, cyclename){
    render(input = paste0(indir, rmd),
         params = list(park = parkcode,
                       park_long = parklong,
                       cycle = cyclename),
         output_file = paste0("Example_for_", 
                              parkcode,
                              "_cycle_", 
                              cyclename, 
                             ".html"),
         output_dir = outdir)
}

# Map render_fun() to the lists. 
# Note that you must have the same # of elements in each list
pmap(list(park_list, park_long_list, cycle_list), 
     ~render_fun(..1, ..2, ..3))

rm(list = ls())
# This opens up a browser window where you can create a personal access token. It may ask you to log into GitHub or confirm your password.
usethis::create_github_token()
gitcreds::gitcreds_set()
# Read in bird checklist
birds <- read.csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/data/birds.csv")

# Preview dataset
head(birds)

# Load packages
library(dplyr)
library(readr)

# Read in bird checklist
birds <- read_csv("https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/data/birds.csv")

# Preview dataset
birds

# Verify that all species names are unique
paste("There are", nrow(birds), "rows in this dataset.")
paste("There are", nrow(distinct(birds)), "unique rows in this dataset.")

# How many species groups are there?
paste("There are", n_distinct(birds$species_group), "species groups.")

# Any missing data?
any(is.na(birds$species_group))
any(is.na(birds$primary_com_name))

# Count species in each species group
group_size <- birds %>%
  group_by(species_group) %>%
  summarize(species_count = n()) %>%
  arrange(-species_count) %>%
  ungroup()

# What are the largest species groups?
group_size

# What are the smallest species groups?
tail(group_size)

knitr::opts_chunk$set(echo = TRUE)
#------------------
# R Packages I
#------------------
knitr::include_graphics("./images/gitshell_init_code.png")
usethis::create_package("D:/NETN/R_Dev/testpackage") # update to work with your file path
usethis::use_git() # sets up local git for new package
usethis::use_github() # creates new github repo called testpackage. 
usethis::use_mit_license() # set license to MIT license (or use a different license.)
usethis::git_default_branch_rename() # renames master to main 
#------------------
# R Packages 2
#------------------
knitr::include_graphics("./images/Project_Options_Build_Tools.png")
#' @title hello
#' @description test package for R training
#' @export
knitr::include_graphics("./images/Build_hello_example.png")
?testpackage::hello
#' @title make_sppcode
#' @description Make a 6-letter code with first 3 letters of genus and species
#'
#' @importFrom dplyr mutate select
#' @importFrom stringr word
#'
#' @param data Name of data frame that contains at least one column with Latin names
#' @param sppname Quoted name of the column that contains the Latin names
#'
#' @return Returns a data frame with a new column named sppcode.
#' @export

make_sppcode <- function(data, sppname){
  data$genus = word(data[,sppname], 1)
  data$species = ifelse(is.na(word(data[,sppname], 2)), "spp.", word(data[,sppname], 2))
  data <- mutate(data, sppcode = toupper(paste0(substr(genus, 1, 3),
                                                substr(species, 1, 3))))
  data2 <- select(data, -genus, -species)
  return(data2)
}

usethis::use_package("dplyr") # for imports which is the default
usethis::use_package("stringr") # for imports which is the default
usethis::use_package("ggplot2", "Suggests") # for suggests
?testpackage::make_sppcode
library(testpackage)
example_dat <- data.frame(Latin_Name = c("Carex limosa", "Arethusa bulbosa", 
                                         "Malaxis unifolia", "Calopogon tuberosus"), 
                          cover = c(10, 40, 10, 50),
                          stems = c(50, 20, 10, 10))

example_dat2 <- make_sppcode(example_dat, "Latin_Name")
head(example_dat2)
#' @title make_sppcode
#' @description Make a 6-letter code with first 3 letters of genus and species
#'
#' @importFrom dplyr mutate select
#' @importFrom stringr word
#'
#' @param data Name of data frame that contains at least one column with Latin names
#' @param sppname Quoted name of the column that contains the Latin names
#'
#' @return Returns a data frame with a new column named sppcode.
#'
#' @examples
#' library(testpackage)
#' 
#' example_dat <- data.frame(Latin_Name = c("Carex limosa", "Arethusa bulbosa", 
#'                                          "Malaxis unifolia", "Calopogon tuberosus"),
#'                           cover = c(10, 40, 10, 50),
#'                           stems = c(50, 20, 10, 10)))
#'
#' example_dat2 <- make_sppcode(example_dat, "Latin_Name")
#' head(example_dat2)
#'
#' @export

make_sppcode <- function(data, sppname){
  data$genus = word(data[,sppname], 1)
  data$species = ifelse(is.na(word(data[,sppname], 2)), "spp.", word(data[,sppname], 2))
  data <- mutate(data, sppcode = toupper(paste0(substr(genus, 1, 3),
                                                substr(species, 1, 3))))
  data2 <- select(data, -genus, -species)
  return(data2)
}

#------------------
# R Packages III
#------------------
library(stringr)
library(dplyr)
example_dat <- data.frame(Latin_Name = c("Carex limosa", "Arethusa bulbosa", 
                                         "Malaxis unifolia", "Calopogon tuberosus"), 
                          cover = c(10, 40, 10, 50),
                          stems = c(50, 20, 10, 10))

#' @title theme_IMD: custom ggplot2 theme for forestNETN
#'
#'
#' @description This is a custom ggplot2 theme that removes the default panel grids
#' from ggplot2 figures, and makes the axes and tick marks grey instead of black.
#'
#' @return This function must be used in conjunction with a ggplot object, and will return a ggplot object with the custom theme.
#'
#' @examples
#' example_dat <- data.frame(Latin_Name = c("Carex limosa", "Arethusa bulbosa",
#'                                          "Malaxis unifolia", "Calopogon tuberosus"),
#'                           cover = c(10, 40, 10, 50),
#'                           stems = c(50, 20, 10, 10))
#' library(ggplot2)
#' p <- ggplot(data = example_dat, aes(x = cover, y = stems)) +
#'      geom_point() +
#'      theme_IMD()
#' p
#'
#' @export


theme_IMD <- function(){

  # Check that suggested package required for this function is installed
  if(!requireNamespace("ggplot2", quietly = TRUE)){
    stop("Package 'ggplot2' needed for this function to work. Please install it.", call. = FALSE)
  }

  ggplot2::theme(panel.grid.major = ggplot2::element_blank(),
                 panel.grid.minor = ggplot2::element_blank(),
                 panel.background = ggplot2::element_rect(color = '#696969', fill = 'white', size = 0.4),
                 plot.background = ggplot2::element_blank(),
                 strip.background = ggplot2::element_rect(color = '#696969', fill = 'grey90', size = 0.4),
                 legend.key = ggplot2::element_blank(),
                 axis.line.x = ggplot2::element_line(color = "#696969", size = 0.4),
                 axis.line.y = ggplot2::element_line(color = "#696969", size = 0.4),
                 axis.ticks = ggplot2::element_line(color = "#696969", size = 0.4)
)}

joinTreeData <- function(park = 'all', from = 2006, to = 2021, QAQC = FALSE, locType = c('VS', 'all'), panels = 1:4,
                         status = c('all', 'active', 'live', 'dead'), 
                         speciesType = c('all', 'native','exotic', 'invasive'),
                         canopyPosition = c("all", "canopy"), dist_m = NA, 
                         eventType = c('complete', 'all'), output = 'short', ...){

  # Match args and classes
  status <- match.arg(status)
  park <- match.arg(park, several.ok = TRUE,
                    c("all", "ACAD", "MABI", "MIMA", "MORR", "ROVA", "SAGA", "SARA", "WEFA"))
  stopifnot(class(from) == "numeric", from >= 2006)
  stopifnot(class(to) == "numeric", to >= 2006)
  locType <- match.arg(locType)
  stopifnot(class(QAQC) == 'logical')
  stopifnot(panels %in% c(1, 2, 3, 4))
  output <- match.arg(output, c("short", "verbose"))
  canopyPosition <- match.arg(canopyPosition)
  speciesType <- match.arg(speciesType)
  
  # Check for tree data in global environment
  tryCatch(tree_vw <- COMN_TreesByEvent,
           error = function(e){stop("COMN_TreesByEvent view not found. Please import view.")}
  )

  }
# dependencies
library(stringr)
library(dplyr)

#function args
data <- example_dat
sppname <- "Latin_Name"

#make_sppcode <- function(data, sppname){
  data$genus = word(data[,sppname], 1)
  data$species = ifelse(is.na(word(data[,sppname], 2)), "spp.", word(data[,sppname], 2))
  data <- mutate(data, sppcode = toupper(paste0(substr(genus, 1, 3),
                                                substr(species, 1, 3))))
  data2 <- select(dat, -genus, -species)
#  return(data2)
#}
make_sppcode_error <- function(data, sppname){
  data$genus = word(data[,sppname], 1)
  data$species = ifelse(is.na(word(data[,sppname], 2)), "spp.", word(data[,sppname], 2))
  data <- mutate(data, sppcode = toupper(paste0(substr(genus, 1, 3),
                                                substr(species, 1, 3))))
  data2 <- select(dat, -genus, -species)
  return(data2)
}
make_sppcode_error(example_dat, sppname = "Latin_Name")

traceback()
debug(make_sppcode_error)
make_sppcode_error(data, "Latin_Name")

knitr::include_graphics("./images/package_commit.jpg")
devtools::install_github("KateMMiller/testpackage")

Meet the Instructors

We are the people who designed and led the training for this week. We hope you found it useful, and that you keep with it!

Andrew Birch
WRD/IMD Water Quality Program Lead



Ellen Cheng
SER Quantitative Ecologist



Kate Miller
NETN/MIDN Quantitative Ecologist



Lauren Pandori
CABR Marine Biologist - MEDN



Thomas Parr
GLKN Program Manager



John Paul Schmit
NCRN Quantitative Ecologist



Sarah Wright
MOJN Data Scientist